CRNN - An End-to-End Trainable Neural Network

Words List (appearance)
#	word	phonetic	sentence
1	trainable	[t'reɪnəbl]	An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 基于图像序列识别的端到端可训练神经网络及其在场景文本识别中的应用 Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. 与以前的场景文本识别系统相比，所提出的架构具有四个不同的特性：（1）与大多数现有的组件需要单独训练和协调的算法相比，它是端对端训练的。 Being robust, rich and trainable, deep convolutional features have been widely adopted for different kinds of visual recognition tasks [25, 12]. 鲁棒的，丰富的和可训练的深度卷积特征已被广泛应用于各种视觉识别任务[25,12]。 Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions). 比较的属性包括：1)端到端训练(E2E Train)；2)从图像中直接学习卷积特征而不是使用手动设计的特征(Conv Ftrs)；3)训练期间不需要字符的实际边界框(CharGT-Free)；4)不受限于预定义字典(Unconstrained)；5)模型大小（如果使用端到端模型），通过模型参数数量来衡量(Model Size, M表示百万)。 Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions). 比较的属性包括：1)端到端训练(E2E Train)；2)从图像中直接学习卷积特征而不是使用手动设计的特征(Conv Ftrs)；3)训练期间不需要字符的实际边界框(CharGT-Free)；4)不受限于预定义字典(Unconstrained)；5)模型大小（如果使用端到端模型），通过模型参数数量来衡量(Model Size, M表示百万)。 E2E Train: This column is to show whether a certain text reading model is end-to-end trainable, without any preprocess or through several separated steps, which indicates such approaches are elegant and clean for training. E2E Train：这一列是为了显示某种文字阅读模型是否可以进行端到端的训练，无需任何预处理或经过几个分离的步骤，这表明这种方法对于训练是优雅且干净的。
2	long-standing	[ˈlɔŋstædiŋ]	Image-based sequence recognition has been a long-standing research topic in computer vision. 基于图像的序列识别一直是计算机视觉中长期存在的研究课题。
3	extraction	[ɪkˈstrækʃn]	A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. 提出了一种将特征提取，序列建模和转录整合到统一框架中的新型神经网络架构。 2.1. Feature Sequence Extraction 2.1. 特征序列提取
4	transcription	[trænˈskrɪpʃn]	A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. 提出了一种将特征提取，序列建模和转录整合到统一框架中的新型神经网络架构。 The network architecture of CRNN, as shown in Fig. 1, consists of three components, including the convolutional layers, the recurrent layers, and a transcription layer, from bottom to top. 如图1所示，CRNN的网络架构由三部分组成，包括卷积层，循环层和转录层，从底向上。 2) recurrent layers, which predict a label distribution for each frame; 3) transcription layer, which translates the per-frame predictions into the final label sequence. 2) 循环层，预测每一帧的标签分布；3) 转录层，将每一帧的预测变为最终的标签序列。 The transcription layer at the top of CRNN is adopted to translate the per-frame predictions by the recurrent layers into a label sequence. 采用CRNN顶部的转录层将循环层的每帧预测转化为标签序列。 2.3. Transcription 2.3. 转录 Transcription is the process of converting the per-frame predictions made by RNN into a label sequence. 转录是将RNN所做的每帧预测转换成标签序列的过程。 Mathematically, transcription is to find the label sequence with the highest probability conditioned on the per-frame predictions. 数学上，转录是根据每帧预测找到具有最高概率的标签序列。 In practice, there exists two modes of transcription, namely the lexicon-free and lexicon-based transcriptions. 在实践中，存在两种转录模式，即无词典转录和基于词典的转录。 In practice, there exists two modes of transcription, namely the lexicon-free and lexicon-based transcriptions. 在实践中，存在两种转录模式，即无词典转录和基于词典的转录。 2.3.2 Lexicon-free transcription 2.3.2 无字典转录 2.3.3 Lexicon-based transcription 2.3.3 基于词典的转录 1 for all sequences in the lexicon and choose the one with the highest probability. To solve this problem, we observe that the label sequences predicted via lexicon-free transcription, described in 2.3.2, are often close to the ground-truth under the edit distance metric. 然而，对于大型词典，例如5万个词的Hunspell拼写检查词典[1]，对词典进行详尽的搜索是非常耗时的，即对词典中的所有序列计算方程1，并选择概率最高的一个。为了解决这个问题，我们观察到，2.3.2中描述的通过无词典转录预测的标签序列通常在编辑距离度量下接近于实际结果。 In particular, in the transcription layer, error differentials are back-propagated with the forward-backward algorithm, as described in [15]. 特别地，在转录层中，如[15]所述，误差使用前向算法进行反向传播。 We implement the network within the Torch7 [10] framework, with custom implementations for the LSTM units (in Torch7/CUDA), the transcription layer (in C++) and the BK-tree data structure (in C++). 我们在Torch7[10]框架内实现了网络，使用定制实现的LSTM单元（Torch7/CUDA），转录层（C++）和BK树数据结构（C++）。 Larger \delta results in more candidates, thus more accurate lexicon-based transcription. 更大的\delta导致更多的候选目标，从而基于词典的转录更准确。
5	arbitrary	[ˈɑ:bɪtrəri]	(2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. （2）它自然地处理任意长度的序列，不涉及字符分割或水平尺度归一化。 Thirdly, RNN is able to operate on sequences of arbitrary lengths, traversing from starts to ends. 第三，RNN能够从头到尾对任意长度的序列进行操作。
6	predefined	[pri:dɪ'faɪnd]	(3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. （3）它不仅限于任何预定义的词汇，并且在无词典和基于词典的场景文本识别任务中都取得了显著的表现。
7	lexicon	[ˈleksɪkən]	(3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. （3）它不仅限于任何预定义的词汇，并且在无词典和基于词典的场景文本识别任务中都取得了显著的表现。 A lexicon is a set of label sequences that prediction is constraint to, e.g. a spell checking dictionary. 词典是一组标签序列，预测受拼写检查字典约束。 In lexicon-free mode, predictions are made without any lexicon. 在无词典模式中，预测时没有任何词典。 In lexicon-based mode, each test sample is associated with a lexicon {\cal D}. 在基于字典的模式中，每个测试采样与词典{\cal D}相关联。 Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation. 基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。 Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation. 基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。 Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation. 基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。 1 for all sequences in the lexicon and choose the one with the highest probability. To solve this problem, we observe that the label sequences predicted via lexicon-free transcription, described in 2.3.2, are often close to the ground-truth under the edit distance metric. 然而，对于大型词典，例如5万个词的Hunspell拼写检查词典[1]，对词典进行详尽的搜索是非常耗时的，即对词典中的所有序列计算方程1，并选择概率最高的一个。为了解决这个问题，我们观察到，2.3.2中描述的通过无词典转录预测的标签序列通常在编辑距离度量下接近于实际结果。 The search time complexity of BK-tree is $O(\log\|{\cal D}\|)$, where $\|{\cal D}\|$ is the lexicon size. BK树的搜索时间复杂度为$O(\log\|{\cal D}\|)$，其中$\|{\cal D}\|$是词典大小。 Therefore this scheme readily extends to very large lexicons. 因此，这个方案很容易扩展到非常大的词典。 In our approach, a BK-tree is constructed offline for a lexicon. 在我们的方法中，一个词典离线构造一个BK树。 Each test image is associated with a 50-words lexicon which is defined by Wang et al. [34]. 每张测试图像与由Wang等人[34]定义的50词的词典相关联。 A full lexicon is built by combining all the per-image lexicons. 通过组合所有的每张图像词汇构建完整的词典。 A full lexicon is built by combining all the per-image lexicons. 通过组合所有的每张图像词汇构建完整的词典。 In addition, we use a 50k words lexicon consisting of the words in the Hunspell spell-checking dictionary [1]. 此外，我们使用由Hunspell拼写检查字典[1]中的单词组成的5万个词的词典。 Each image has been associated to a 50-words lexicon and a 1k-words lexicon. 每张图像关联一个50词的词典和一个1000词的词典。 Each image has been associated to a 50-words lexicon and a 1k-words lexicon. 每张图像关联一个50词的词典和一个1000词的词典。 Each word image has a 50 words lexicon defined by Wang et al. [34]. 每张单词图像都有一个由Wang等人[34]定义的50个词的词典。 The average testing time is 0.16s/sample, as measured on IC03 without a lexicon. 平均测试时间为0.16s/样本，在IC03上测得的，没有词典。 The approximate lexicon search is applied to the 50k lexicon of IC03, with the parameter δ set to 3. 近似词典搜索应用于IC03的50k词典，参数δ设置为3。 The approximate lexicon search is applied to the 50k lexicon of IC03, with the parameter δ set to 3. 近似词典搜索应用于IC03的50k词典，参数δ设置为3。 In the second row, “50”, “1k”, “50k” and “Full” denote the lexicon used, and “None” denotes recognition without a lexicon. 在第二行，“50”，“1k”，“50k”和“Full”表示使用的字典，“None”表示识别没有字典。 In the second row, “50”, “1k”, “50k” and “Full” denote the lexicon used, and “None” denotes recognition without a lexicon. 在第二行，“50”，“1k”，“50k”和“Full”表示使用的字典，“None”表示识别没有字典。 In the constrained lexicon cases, our method consistently outperforms most state-of-the-arts approaches, and in average beats the best text reader proposed in [22]. 在有约束词典的情况中，我们的方法始终优于大多数最新的方法，并且平均打败了[22]中提出的最佳文本阅读器。 Specifically, we obtain superior performance on IIIT5k, and SVT compared to [22], only achieved lower performance on IC03 with the “Full” lexicon. 具体来说，与[22]相比，我们在IIIT5k和SVT上获得了卓越的性能，仅在IC03上通过“Full”词典实现了较低性能。 In the unconstrained lexicon cases, our method achieves the best performance on SVT, yet, is still behind some approaches [8, 22] on IC03 and IC13. 在无约束词典的情况下，我们的方法在SVT上仍取得了最佳性能，但在IC03和IC13上仍然落后于一些方法[8,22]。 Note that the blanks in the “none” columns of Table 2 denote that such approaches are unable to be applied to recognition without lexicon or did not report the recognition accuracies in the unconstrained cases. 注意，表2的“none”列中的空白表示这种方法不能应用于没有词典的识别，或者在无约束的情况下不能报告识别精度。 The best persformance is reported by [22] in the unconstrained lexicon cases, benefiting from its large dictionary, however, it is not a model strictly unconstrained to a lexicon as mentioned before. [22]中报告的最佳性能是在无约束词典的情况下，受益于它的大字典，然而，它不是前面提到的严格的无约束词典模型。 The best persformance is reported by [22] in the unconstrained lexicon cases, benefiting from its large dictionary, however, it is not a model strictly unconstrained to a lexicon as mentioned before. [22]中报告的最佳性能是在无约束词典的情况下，受益于它的大字典，然而，它不是前面提到的严格的无约束词典模型。 In this sense, our results in the unconstrained lexicon case are still promising. 在这个意义上，我们在无限制词典表中的结果仍然是有前途的。 Red bars: lexicon search time per sample. 红条：每个样本的词典搜索时间。 Tested on the IC03 dataset with the 50k lexicon. 在IC03数据集上使用50k词典进行的测试。
8	lexicon-based	[!≈ ˈleksɪkən beɪst]	(3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. （3）它不仅限于任何预定义的词汇，并且在无词典和基于词典的场景文本识别任务中都取得了显著的表现。 In practice, there exists two modes of transcription, namely the lexicon-free and lexicon-based transcriptions. 在实践中，存在两种转录模式，即无词典转录和基于词典的转录。 In lexicon-based mode, predictions are made by choosing the label sequence that has the highest probability. 在基于词典的模式中，通过选择具有最高概率的标签序列进行预测。 2.3.3 Lexicon-based transcription 2.3.3 基于词典的转录 In lexicon-based mode, each test sample is associated with a lexicon {\cal D}. 在基于字典的模式中，每个测试采样与词典{\cal D}相关联。 Larger \delta results in more candidates, thus more accurate lexicon-based transcription. 更大的\delta导致更多的候选目标，从而基于词典的转录更准确。
9	real-world	[!≈ ˈri:əl wɜ:ld]	(4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. （4）它产生了一个有效而小得多的模型，这对于现实世界的应用场景更为实用。 Our network is trained on the synthetic data once, and tested on all other real-world test datasets without any fine-tuning on their training data. 我们的网络在合成数据上进行了一次训练，并在所有其它现实世界的测试数据集上进行了测试，而没有在其训练数据上进行任何微调。 It contains 200 samples, some of which are shown in Fig. 5. b; 3) “Real-World”, which contains 200 images of score fragments taken from music books with a phone camera. 它包含200个样本，其中一些如图5.b所示；3）“现实世界”，其中包含用手机相机拍摄的音乐书籍中的200张图像。 (c) Real-world score images taken with a mobile phone camera. (c)用手机相机拍摄的现实世界的乐谱图像。 The CRNN outperforms the two commercial systems by a large margin. The Capella Scan and PhotoScore systems perform reasonably well on the Clean dataset, but their performances drop significantly on synthesized and real-world data. Capella Scan和PhotoScore系统在干净的数据集上表现相当不错，但是它们的性能在合成和现实世界数据方面显著下降。 The main reason is that they rely on robust binarization to detect staff lines and notes, but the binarization step often fails on synthesized and real-world data due to bad lighting condition, noise corruption and cluttered background. 主要原因是它们依赖于强大的二值化来检五线谱和音符，但是由于光线不良，噪音破坏和杂乱的背景，二值化步骤经常会在合成数据和现实数据上失败。 To further speed up CRNN and make it more practical in real-world applications is another direction that is worthy of exploration in the future. 进一步加快CRNN，使其在现实应用中更加实用，是未来值得探索的另一个方向。
10	scenario	[səˈnɑ:riəʊ]	(4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. （4）它产生了一个有效而小得多的模型，这对于现实世界的应用场景更为实用。
11	ICDAR	[!≈ aɪ si: di: eɪ ɑ:(r)]	The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. 在包括IIIT-5K，Street View Text和ICDAR数据集在内的标准基准数据集上的实验证明了提出的算法比现有技术的更有优势。 Four popular benchmarks for scene text recognition are used for performance evaluation, namely ICDAR 2003 (IC03), ICDAR 2013 (IC13), IIIT 5k-word (IIIT5k), and Street View Text (SVT). 有四个流行的基准数据集用于场景文本识别的性能评估，即ICDAR 2003（IC03），ICDAR 2013（IC13），IIIT 5k-word（IIIT5k）和Street View Text (SVT)。 Four popular benchmarks for scene text recognition are used for performance evaluation, namely ICDAR 2003 (IC03), ICDAR 2013 (IC13), IIIT 5k-word (IIIT5k), and Street View Text (SVT). 有四个流行的基准数据集用于场景文本识别的性能评估，即ICDAR 2003（IC03），ICDAR 2013（IC13），IIIT 5k-word（IIIT5k）和Street View Text (SVT)。
12	generality	[ˌdʒenəˈræləti]	Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it. 此外，提出的算法在基于图像的音乐配乐识别任务中表现良好，这显然证实了它的泛化性。 To further demonstrate the generality of CRNN, we verify the proposed algorithm on a music score recognition task in Sec. 3.4. 为了进一步证明CRNN的泛化性，在3.4小节我们在乐谱识别任务上验证了提出的算法。 The results have shown the generality of CRNN, in that it can be readily applied to other image-based sequence recognition problems, requiring minimal domain knowledge. 结果显示了CRNN的泛化性，因为它可以很容易地应用于其它的基于图像的序列识别问题，需要极少的领域知识。 In addition, CRNN significantly outperforms other competitors on a benchmark for Optical Music Recognition (OMR), which verifies the generality of CRNN. 此外，CRNN在光学音乐识别（OMR）的基准数据集上显著优于其它的竞争者，这验证了CRNN的泛化性。
13	revival	[rɪˈvaɪvl]	Recently, the community has seen a strong revival of neural networks, which is mainly stimulated by the great success of deep neural network models, specifically Deep Convolutional Neural Networks (DCNN), in various vision tasks. 最近，社区已经看到神经网络的强大复兴，这主要受到深度神经网络模型，特别是深度卷积神经网络（DCNN）在各种视觉任务中的巨大成功的推动。
14	DCNN	[!≈ di: si: en en]	Recently, the community has seen a strong revival of neural networks, which is mainly stimulated by the great success of deep neural network models, specifically Deep Convolutional Neural Networks (DCNN), in various vision tasks. 最近，社区已经看到神经网络的强大复兴，这主要受到深度神经网络模型，特别是深度卷积神经网络（DCNN）在各种视觉任务中的巨大成功的推动。 Consequently, the most popular deep models like DCNN [25, 26] cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and thus are incapable of producing a variable-length label sequence. 因此，最流行的深度模型像DCNN[25,26]不能直接应用于序列预测，因为DCNN模型通常对具有固定维度的输入和输出进行操作，因此不能产生可变长度的标签序列。 Consequently, the most popular deep models like DCNN [25, 26] cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and thus are incapable of producing a variable-length label sequence. 因此，最流行的深度模型像DCNN[25,26]不能直接应用于序列预测，因为DCNN模型通常对具有固定维度的输入和输出进行操作，因此不能产生可变长度的标签序列。 For example, the algorithms in [35, 8] firstly detect individual characters and then recognize these detected characters with DCNN models, which are trained using labeled character images. 例如，[35,8]中的算法首先检测单个字符，然后用DCNN模型识别这些检测到的字符，并使用标注的字符图像进行训练。 In summary, current systems based on DCNN can not be directly used for image-based sequence recognition. 总之，目前基于DCNN的系统不能直接用于基于图像的序列识别。 The proposed neural network model is named as Convolutional Recurrent Neural Network (CRNN), since it is a combination of DCNN and RNN. 所提出的神经网络模型被称为卷积循环神经网络（CRNN），因为它是DCNN和RNN的组合。 2) It has the same property of DCNN on learning informative representations directly from image data, requiring neither hand-craft features nor preprocessing steps, including binarization/segmentation, component localization, etc.; 2）直接从图像数据学习信息表示时具有与DCNN相同的性质，既不需要手工特征也不需要预处理步骤，包括二值化/分割，组件定位等； 5) It achieves better or highly competitive performance on scene texts (word recognition) than the prior arts [23, 8]; 6) It contains much less parameters than a standard DCNN model, consuming less storage space. 5）与现有技术相比，它在场景文本（字识别）上获得更好或更具竞争力的表现[23,8]。6）它比标准DCNN模型包含的参数要少得多，占用更少的存储空间。
15	drastically	['drɑ:stɪklɪ]	Another unique property of sequence-like objects is that their lengths may vary drastically. 类序列对象的另一个独特之处在于它们的长度可能会有很大变化。
16	congratulations	[kənˌgrætjʊ'leɪʃənz]	For instance, English words can either consist of 2 characters such as “OK” or 15 characters such as “congratulations”. 例如，英文单词可以由2个字符组成，如“OK”，或由15个字符组成，如“congratulations”。
17	incapable	[ɪnˈkeɪpəbl]	Consequently, the most popular deep models like DCNN [25, 26] cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and thus are incapable of producing a variable-length label sequence. 因此，最流行的深度模型像DCNN[25,26]不能直接应用于序列预测，因为DCNN模型通常对具有固定维度的输入和输出进行操作，因此不能产生可变长度的标签序列。
18	variable-length	['veərɪəbll'eŋθ]	Consequently, the most popular deep models like DCNN [25, 26] cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and thus are incapable of producing a variable-length label sequence. 因此，最流行的深度模型像DCNN[25,26]不能直接应用于序列预测，因为DCNN模型通常对具有固定维度的输入和输出进行操作，因此不能产生可变长度的标签序列。
19	e.g.	[ˌi: ˈdʒi:]	Some attempts have been made to address this problem for a specific sequence-like object (e.g. scene text). 已经针对特定的类似序列的对象（例如场景文本）进行了一些尝试来解决该问题。 Besides, some ambiguous characters are easier to distinguish when observing their contexts, e.g. it is easier to recognize “il” by contrasting the character heights than by recognizing each of them separately. 此外，一些模糊的字符在观察其上下文时更容易区分，例如，通过对比字符高度更容易识别“il”而不是分别识别它们中的每一个。 A lexicon is a set of label sequences that prediction is constraint to, e.g. a spell checking dictionary. 词典是一组标签序列，预测受拼写检查字典约束。 Here, each $y_t \in\Re^{\|{\cal L}’\|}$ is a probability distribution over the set ${\cal L}’ = {\cal L} \cup$, where ${\cal L}$ contains all labels in the task (e.g. all English characters), as well as a ’blank’ label denoted by -. 这里，每个$y_t \in\Re^{\|{\cal L}’\|}$是在集合${\cal L}’ = {\cal L} \cup$上的概率分布，其中${\cal L}$包含了任务中的所有标签（例如，所有英文字符），以及由-表示的“空白”标签。 Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation. 基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。 Unlike [22], CRNN is not limited to recognize a word in a known dictionary, and able to handle random strings (e.g. telephone numbers), sentences or other scripts like Chinese words. 与[22]不同，CRNN不限于识别已知字典中的单词，并且能够处理随机字符串（例如电话号码），句子或其他诸如中文单词的脚本。 Consequently, some notes can be recognized by comparing them with the nearby notes, e.g. contrasting their vertical positions. 因此，通过将一些音符与附近的音符进行比较可以识别它们，例如对比他们的垂直位置。 It directly runs on coarse level labels (e.g. words), requiring no detailed annotations for each individual element (e.g. characters) in the training phase. 它直接在粗粒度的标签（例如单词）上运行，在训练阶段不需要详细标注每一个单独的元素（例如字符）。 It directly runs on coarse level labels (e.g. words), requiring no detailed annotations for each individual element (e.g. characters) in the training phase. 它直接在粗粒度的标签（例如单词）上运行，在训练阶段不需要详细标注每一个单独的元素（例如字符）。
20	generalized	[ˈdʒenrəlaɪzd]	It turns out a large trained model with a huge number of classes, which is difficult to be generalized to other types of sequence-like objects, such as Chinese texts, musical scores, etc., because the numbers of basic combinations of such kind of sequences can be greater than 1 million. 结果是一个大的训练模型中有很多类，这很难泛化到其它类型的类序列对象，如中文文本，音乐配乐等，因为这种序列的基本组合数目可能大于100万。
21	recurrent	[rɪˈkʌrənt]	Recurrent neural networks (RNN) models, another important branch of the deep neural networks family, were mainly designed for handling sequences. 循环神经网络（RNN）模型是深度神经网络家族中的另一个重要分支，主要是设计来处理序列。 The proposed neural network model is named as Convolutional Recurrent Neural Network (CRNN), since it is a combination of DCNN and RNN. 所提出的神经网络模型被称为卷积循环神经网络（CRNN），因为它是DCNN和RNN的组合。 The network architecture of CRNN, as shown in Fig. 1, consists of three components, including the convolutional layers, the recurrent layers, and a transcription layer, from bottom to top. 如图1所示，CRNN的网络架构由三部分组成，包括卷积层，循环层和转录层，从底向上。 2) recurrent layers, which predict a label distribution for each frame; 3) transcription layer, which translates the per-frame predictions into the final label sequence. 2) 循环层，预测每一帧的标签分布；3) 转录层，将每一帧的预测变为最终的标签序列。 On top of the convolutional network, a recurrent network is built for making prediction for each frame of the feature sequence, outputted by the convolutional layers. 在卷积网络之上，构建了一个循环网络，用于对卷积层输出的特征序列的每一帧进行预测。 The transcription layer at the top of CRNN is adopted to translate the per-frame predictions by the recurrent layers into a label sequence. 采用CRNN顶部的转录层将循环层的每帧预测转化为标签序列。 Then a sequence of feature vectors is extracted from the feature maps produced by the component of convolutional layers, which is the input for the recurrent layers. 然后从卷积层组件产生的特征图中提取特征向量序列，这些特征向量序列作为循环层的输入。 A deep bidirectional Recurrent Neural Network is built on the top of the convolutional layers, as the recurrent layers. 一个深度双向循环神经网络是建立在卷积层的顶部，作为循环层。 A deep bidirectional Recurrent Neural Network is built on the top of the convolutional layers, as the recurrent layers. 一个深度双向循环神经网络是建立在卷积层的顶部，作为循环层。 The recurrent layers predict a label distribution $y_t$ for each frame $x_t$ in the feature sequence $x = x_1,…,x_T$. 循环层预测特征序列$x = x_1,…,x_T$中每一帧$x_t$的标签分布$y_t$。 The advantages of the recurrent layers are three-fold. 循环层的优点是三重的。 Secondly, RNN can back-propagates error differentials to its input, i.e. the convolutional layer, allowing us to jointly train the recurrent layers and the convolutional layers in a unified network. 其次，RNN可以将误差差值反向传播到其输入，即卷积层，从而允许我们在统一的网络中共同训练循环层和卷积层。 In recurrent layers, error differentials are propagated in the opposite directions of the arrows shown in Fig. 3. b, i.e. Back-Propagation Through Time (BPTT). 在循环层中，误差在图3.b所示箭头的相反方向传播，即反向传播时间（BPTT）。 At the bottom of the recurrent layers, the sequence of propagated differentials are concatenated into maps, inverting the operation of converting feature maps into feature sequences, and fed back to the convolutional layers. 在循环层的底部，传播差异的序列被连接成映射，将特征映射转换为特征序列的操作进行反转并反馈到卷积层。 In practice, we create a custom network layer, called “Map-to-Sequence”, as the bridge between convolutional layers and recurrent layers. 实际上，我们创建一个称为“Map-to-Sequence”的自定义网络层，作为卷积层和循环层之间的桥梁。 where $\mathbf{y}_{i}$ is the sequence produced by the recurrent and convolutional layers from $I_{i}$. $\mathbf{y}_{i}$是循环层和卷积层从I{i}$生成的序列。 In the recurrent layers, the Back-Propagation Through Time (BPTT) is applied to calculate the error differentials. 在循环层中，应用随时间反向传播（BPTT）来计算误差。 The network not only has deep convolutional layers, but also has recurrent layers. 网络不仅有深度卷积层，而且还有循环层。 Besides, recurrent layers in CRNN can utilize contextual information in the score. 此外，CRNN中的循环层可以利用乐谱中的上下文信息。 In this paper, we have presented a novel neural network architecture, called Convolutional Recurrent Neural Network (CRNN), which integrates the advantages of both Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). 在本文中，我们提出了一种新颖的神经网络架构，称为卷积循环神经网络（CRNN），其集成了卷积神经网络（CNN）和循环神经网络（RNN）的优点。 In this paper, we have presented a novel neural network architecture, called Convolutional Recurrent Neural Network (CRNN), which integrates the advantages of both Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). 在本文中，我们提出了一种新颖的神经网络架构，称为卷积循环神经网络（CRNN），其集成了卷积神经网络（CNN）和循环神经网络（RNN）的优点。
22	geometrical	[ˌdʒi:ə'metrɪkl]	For example, Graves et al. [16] extract a set of geometrical or image features from handwritten texts, while Su and Lu [33] convert word images into sequential HOG features. 例如，Graves等[16]从手写文本中提取一系列几何或图像特征，而Su和Lu[33]将字符图像转换为序列HOG特征。
23	sequential	[sɪˈkwenʃl]	For example, Graves et al. [16] extract a set of geometrical or image features from handwritten texts, while Su and Lu [33] convert word images into sequential HOG features. 例如，Graves等[16]从手写文本中提取一系列几何或图像特征，而Su和Lu[33]将字符图像转换为序列HOG特征。 Such component is used to extract a sequential feature representation from an input image. 这样的组件用于从输入图像中提取序列特征表示。 In CRNN, we convey deep features into sequential representations in order to be invariant to the length variation of sequence-like objects. 在CRNN中，我们将深度特征传递到序列表示中，以便对类序列对象的长度变化保持不变。
24	HOG	[hɒg]	For example, Graves et al. [16] extract a set of geometrical or image features from handwritten texts, while Su and Lu [33] convert word images into sequential HOG features. 例如，Graves等[16]从手写文本中提取一系列几何或图像特征，而Su和Lu[33]将字符图像转换为序列HOG特征。
25	insightful	[ˈɪnsaɪtfʊl]	Several conventional scene text recognition methods that are not based on neural networks also brought insightful ideas and novel representations into this field. 一些不是基于神经网络的传统场景文本识别方法也为这一领域带来了有见地的想法和新颖的表现。
26	Almazan		For example, Almazan et al. [5] and Rodriguez-Serrano et al. [30] proposed to embed word images and text strings in a common vectorial subspace, and word recognition is converted into a retrieval problem. 例如，Almazan等人[5]和Rodriguez-Serrano等人[30]提出将单词图像和文本字符串嵌入到公共向量子空间中，并将词识别转换为检索问题。
27	Rodriguez-Serrano		For example, Almazan et al. [5] and Rodriguez-Serrano et al. [30] proposed to embed word images and text strings in a common vectorial subspace, and word recognition is converted into a retrieval problem. 例如，Almazan等人[5]和Rodriguez-Serrano等人[30]提出将单词图像和文本字符串嵌入到公共向量子空间中，并将词识别转换为检索问题。
28	vectorial	[vek'tɒrɪəl]	For example, Almazan et al. [5] and Rodriguez-Serrano et al. [30] proposed to embed word images and text strings in a common vectorial subspace, and word recognition is converted into a retrieval problem. 例如，Almazan等人[5]和Rodriguez-Serrano等人[30]提出将单词图像和文本字符串嵌入到公共向量子空间中，并将词识别转换为检索问题。
29	retrieval	[rɪˈtri:vl]	For example, Almazan et al. [5] and Rodriguez-Serrano et al. [30] proposed to embed word images and text strings in a common vectorial subspace, and word recognition is converted into a retrieval problem. 例如，Almazan等人[5]和Rodriguez-Serrano等人[30]提出将单词图像和文本字符串嵌入到公共向量子空间中，并将词识别转换为检索问题。
30	Gordo		Yao et al. [36] and Gordo et al. [14] used mid-level features for scene text recognition. Yao等人[36]和Gordo等人[14]使用中层特征进行场景文本识别。
31	annotation	[ˌænə'teɪʃn]	For sequence-like objects, CRNN possesses several distinctive advantages over conventional neural network models: 1) It can be directly learned from sequence labels (for instance, words), requiring no detailed annotations (for instance, characters); 对于类序列对象，CRNN与传统神经网络模型相比具有一些独特的优点：1）可以直接从序列标签（例如单词）学习，不需要详细的标注（例如字符）； Our method uses only synthetic text with word level labels as the training data, very different to PhotoOCR [8] which used 7.9 millions of real word images with character-level annotations for training. 我们的方法只使用具有单词级标签的合成文本作为训练数据，与PhotoOCR[8]非常不同，后者使用790万个具有字符级标注的真实单词图像进行训练。 CharGT-Free: This column is to indicate whether the character-level annotations are essential for training the model. CharGT-Free：这一列用来表明字符级标注对于训练模型是否是必要的。 As the input and output labels of CRNN can be a sequence, character-level annotations are not necessary. 由于CRNN的输入和输出标签是序列，因此字符级标注是不必要的。 It directly runs on coarse level labels (e.g. words), requiring no detailed annotations for each individual element (e.g. characters) in the training phase. 它直接在粗粒度的标签（例如单词）上运行，在训练阶段不需要详细标注每一个单独的元素（例如字符）。
32	informative	[ɪnˈfɔ:mətɪv]	2) It has the same property of DCNN on learning informative representations directly from image data, requiring neither hand-craft features nor preprocessing steps, including binarization/segmentation, component localization, etc.; 2）直接从图像数据学习信息表示时具有与DCNN相同的性质，既不需要手工特征也不需要预处理步骤，包括二值化/分割，组件定位等；
33	hand-craft	['hæn(d)krɑːft]	2) It has the same property of DCNN on learning informative representations directly from image data, requiring neither hand-craft features nor preprocessing steps, including binarization/segmentation, component localization, etc.; 2）直接从图像数据学习信息表示时具有与DCNN相同的性质，既不需要手工特征也不需要预处理步骤，包括二值化/分割，组件定位等；
34	binarization		2) It has the same property of DCNN on learning informative representations directly from image data, requiring neither hand-craft features nor preprocessing steps, including binarization/segmentation, component localization, etc.; 2）直接从图像数据学习信息表示时具有与DCNN相同的性质，既不需要手工特征也不需要预处理步骤，包括二值化/分割，组件定位等； The main reason is that they rely on robust binarization to detect staff lines and notes, but the binarization step often fails on synthesized and real-world data due to bad lighting condition, noise corruption and cluttered background. 主要原因是它们依赖于强大的二值化来检五线谱和音符，但是由于光线不良，噪音破坏和杂乱的背景，二值化步骤经常会在合成数据和现实数据上失败。 The main reason is that they rely on robust binarization to detect staff lines and notes, but the binarization step often fails on synthesized and real-world data due to bad lighting condition, noise corruption and cluttered background. 主要原因是它们依赖于强大的二值化来检五线谱和音符，但是由于光线不良，噪音破坏和杂乱的背景，二值化步骤经常会在合成数据和现实数据上失败。
35	unconstrained	[ˌʌnkən'streɪnd]	3) It has the same property of RNN, being able to produce a sequence of labels; 4) It is unconstrained to the lengths of sequence-like objects, requiring only height normalization in both training and testing phases; 3）具有与RNN相同的性质，能够产生一系列标签；4）对类序列对象的长度无约束，只需要在训练阶段和测试阶段对高度进行归一化； In the unconstrained lexicon cases, our method achieves the best performance on SVT, yet, is still behind some approaches [8, 22] on IC03 and IC13. 在无约束词典的情况下，我们的方法在SVT上仍取得了最佳性能，但在IC03和IC13上仍然落后于一些方法[8,22]。 Note that the blanks in the “none” columns of Table 2 denote that such approaches are unable to be applied to recognition without lexicon or did not report the recognition accuracies in the unconstrained cases. 注意，表2的“none”列中的空白表示这种方法不能应用于没有词典的识别，或者在无约束的情况下不能报告识别精度。 The best persformance is reported by [22] in the unconstrained lexicon cases, benefiting from its large dictionary, however, it is not a model strictly unconstrained to a lexicon as mentioned before. [22]中报告的最佳性能是在无约束词典的情况下，受益于它的大字典，然而，它不是前面提到的严格的无约束词典模型。 The best persformance is reported by [22] in the unconstrained lexicon cases, benefiting from its large dictionary, however, it is not a model strictly unconstrained to a lexicon as mentioned before. [22]中报告的最佳性能是在无约束词典的情况下，受益于它的大字典，然而，它不是前面提到的严格的无约束词典模型。 In this sense, our results in the unconstrained lexicon case are still promising. 在这个意义上，我们在无限制词典表中的结果仍然是有前途的。 For further understanding the advantages of the proposed algorithm over other text recognition approaches, we provide a comprehensive comparison on several properties named E2E Train, Conv Ftrs, CharGT-Free, Unconstrained, and Model Size, as summarized in Table 3. 为了进一步了解与其它文本识别方法相比，所提出算法的优点，我们提供了在一些特性上的综合比较，这些特性名称为E2E Train，Conv Ftrs，CharGT-Free，Unconstrained和Model Size，如表3所示。 Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions). 比较的属性包括：1)端到端训练(E2E Train)；2)从图像中直接学习卷积特征而不是使用手动设计的特征(Conv Ftrs)；3)训练期间不需要字符的实际边界框(CharGT-Free)；4)不受限于预定义字典(Unconstrained)；5)模型大小（如果使用端到端模型），通过模型参数数量来衡量(Model Size, M表示百万)。 Unconstrained: This column is to indicate whether the trained model is constrained to a specific dictionary, unable to handling out-of-dictionary words or random sequences. Unconstrained：这一列用来表明训练模型是否受限于一个特定的字典，是否不能处理字典之外的单词或随机序列。
36	outputted	['aʊt.pʊt]	On top of the convolutional network, a recurrent network is built for making prediction for each frame of the feature sequence, outputted by the convolutional layers. 在卷积网络之上，构建了一个循环网络，用于对卷积层输出的特征序列的每一帧进行预测。
37	eg		Though CRNN is composed of different kinds of network architectures (eg. CNN and RNN), it can be jointly trained with one loss function. 虽然CRNN由不同类型的网络架构（如CNN和RNN）组成，但可以通过一个损失函数进行联合训练。
38	jointly	[dʒɔɪntlɪ]	Though CRNN is composed of different kinds of network architectures (eg. CNN and RNN), it can be jointly trained with one loss function. 虽然CRNN由不同类型的网络架构（如CNN和RNN）组成，但可以通过一个损失函数进行联合训练。 Secondly, RNN can back-propagates error differentials to its input, i.e. the convolutional layer, allowing us to jointly train the recurrent layers and the convolutional layers in a unified network. 其次，RNN可以将误差差值反向传播到其输入，即卷积层，从而允许我们在统一的网络中共同训练循环层和卷积层。
39	concatenation	[kənˌkætəˈneɪʃn]	This means the i-th feature vector is the concatenation of the i-th columns of all the maps. 这意味着第i个特征向量是所有特征图第i列的连接。
40	invariant	[ɪnˈveəriənt]	As the layers of convolution, max-pooling, and element-wise activation function operate on local regions, they are translation invariant. 由于卷积层，最大池化层和元素激活函数在局部区域上执行，因此它们是平移不变的。 In CRNN, we convey deep features into sequential representations in order to be invariant to the length variation of sequence-like objects. 在CRNN中，我们将深度特征传递到序列表示中，以便对类序列对象的长度变化保持不变。
41	receptive	[rɪˈseptɪv]	Therefore, each column of the feature maps corresponds to a rectangle region of the original image (termed the receptive field), and such rectangle regions are in the same order to their corresponding columns on the feature maps from left to right. 因此，特征图的每列对应于原始图像的一个矩形区域（称为感受野），并且这些矩形区域与特征图上从左到右的相应列具有相同的顺序。 As illustrated in Fig. 2, each vector in the feature sequence is associated with a receptive field, and can be considered as the image descriptor for that region. 如图2所示，特征序列中的每个向量关联一个感受野，并且可以被认为是该区域的图像描述符。 The receptive field. 感受野。 Each vector in the extracted feature sequence is associated with a receptive field on the input image, and can be considered as the feature vector of that field. 提取的特征序列中的每一个向量关联输入图像的一个感受野，可认为是该区域的特征向量。 On top of that, the rectangular pooling windows yield rectangular receptive fields (illustrated in Fig. 2), which are beneficial for recognizing some characters that have narrow shapes, such as ’i’ and ’l’. 最重要的是，矩形池窗口产生矩形感受野（如图2所示），这有助于识别一些具有窄形状的字符，例如i和l。
42	descriptor	[dɪˈskrɪptə(r)]	As illustrated in Fig. 2, each vector in the feature sequence is associated with a receptive field, and can be considered as the image descriptor for that region. 如图2所示，特征序列中的每个向量关联一个感受野，并且可以被认为是该区域的图像描述符。
43	holistic	[həʊˈlɪstɪk]	However, these approaches usually extract holistic representation of the whole image by CNN, then the local deep features are collected for recognizing each component of a sequence-like object. 然而，这些方法通常通过CNN提取整个图像的整体表示，然后收集局部深度特征来识别类序列对象的每个分量。
44	capability	[ˌkeɪpəˈbɪləti]	Firstly, RNN has a strong capability of capturing contextual information within a sequence. 首先，RNN具有很强的捕获序列内上下文信息的能力。 But it provides a new scheme for OMR, and has shown promising capabilities in pitch recognition. 但它为OMR提供了一个新的方案，并且在音高识别方面表现出有前途的能力。
45	contextual	[kənˈtekstʃuəl]	Firstly, RNN has a strong capability of capturing contextual information within a sequence. 首先，RNN具有很强的捕获序列内上下文信息的能力。 Using contextual cues for image-based sequence recognition is more stable and helpful than treating each symbol independently. 对于基于图像的序列识别使用上下文提示比独立处理每个符号更稳定且更有帮助。 Besides, recurrent layers in CRNN can utilize contextual information in the score. 此外，CRNN中的循环层可以利用乐谱中的上下文信息。
46	cue	[kju:]	Using contextual cues for image-based sequence recognition is more stable and helpful than treating each symbol independently. 对于基于图像的序列识别使用上下文提示比独立处理每个符号更稳定且更有帮助。
47	il		Besides, some ambiguous characters are easier to distinguish when observing their contexts, e.g. it is easier to recognize “il” by contrasting the character heights than by recognizing each of them separately. 此外，一些模糊的字符在观察其上下文时更容易区分，例如，通过对比字符高度更容易识别“il”而不是分别识别它们中的每一个。
48	back-propagate	[!≈ bæk ˈprɒpəgeɪt]	Secondly, RNN can back-propagates error differentials to its input, i.e. the convolutional layer, allowing us to jointly train the recurrent layers and the convolutional layers in a unified network. 其次，RNN可以将误差差值反向传播到其输入，即卷积层，从而允许我们在统一的网络中共同训练循环层和卷积层。
49	differential	[ˌdɪfəˈrenʃl]	Secondly, RNN can back-propagates error differentials to its input, i.e. the convolutional layer, allowing us to jointly train the recurrent layers and the convolutional layers in a unified network. 其次，RNN可以将误差差值反向传播到其输入，即卷积层，从而允许我们在统一的网络中共同训练循环层和卷积层。 In recurrent layers, error differentials are propagated in the opposite directions of the arrows shown in Fig. 3. b, i.e. Back-Propagation Through Time (BPTT). 在循环层中，误差在图3.b所示箭头的相反方向传播，即反向传播时间（BPTT）。 At the bottom of the recurrent layers, the sequence of propagated differentials are concatenated into maps, inverting the operation of converting feature maps into feature sequences, and fed back to the convolutional layers. 在循环层的底部，传播差异的序列被连接成映射，将特征映射转换为特征序列的操作进行反转并反馈到卷积层。 In particular, in the transcription layer, error differentials are back-propagated with the forward-backward algorithm, as described in [15]. 特别地，在转录层中，如[15]所述，误差使用前向算法进行反向传播。 In the recurrent layers, the Back-Propagation Through Time (BPTT) is applied to calculate the error differentials. 在循环层中，应用随时间反向传播（BPTT）来计算误差。
50	i.e.	[ˌaɪ ˈi:]	Secondly, RNN can back-propagates error differentials to its input, i.e. the convolutional layer, allowing us to jointly train the recurrent layers and the convolutional layers in a unified network. 其次，RNN可以将误差差值反向传播到其输入，即卷积层，从而允许我们在统一的网络中共同训练循环层和卷积层。 In recurrent layers, error differentials are propagated in the opposite directions of the arrows shown in Fig. 3. b, i.e. Back-Propagation Through Time (BPTT). 在循环层中，误差在图3.b所示箭头的相反方向传播，即反向传播时间（BPTT）。 The sequence $\mathbf{l}^{}$ is approximately found by $\mathbf{l}^{}\approx{\cal B}(\arg\max{\boldsymbol{\pi}}p(\boldsymbol{\pi}\|\mathbf{y}))$, i.e. taking the most probable label $\pi{t}$ at each time stampt, and map the resulted sequence onto $\mathbf{l}^{}$. 序列$\mathbf{l}^{}$通过$\mathbf{l}^{}\approx{\cal B}(\arg\max{\boldsymbol{\pi}}p(\boldsymbol{\pi}\|\mathbf{y}))$近似发现，即在每个时间戳t采用最大概率的标签$\pi{t}$，并将结果序列映射到$\mathbf{l}^{}$。 Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation. 基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。 Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation. 基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。 Two measures are used for evaluating the recognition performance: 1) fragment accuracy, i.e. the percentage of score fragments correctly recognized; 2) average edit distance, i.e. the average edit distance between predicted pitch sequences and the ground truths. 使用两种方法来评估识别性能：1）片段准确度，即正确识别的乐谱片段的百分比；2）平均编辑距离，即预测音调序列与真实值之间的平均编辑距离。 Two measures are used for evaluating the recognition performance: 1) fragment accuracy, i.e. the percentage of score fragments correctly recognized; 2) average edit distance, i.e. the average edit distance between predicted pitch sequences and the ground truths. 使用两种方法来评估识别性能：1）片段准确度，即正确识别的乐谱片段的百分比；2）平均编辑距离，即预测音调序列与真实值之间的平均编辑距离。
51	traverse	[trəˈvɜ:s]	Thirdly, RNN is able to operate on sequences of arbitrary lengths, traversing from starts to ends. 第三，RNN能够从头到尾对任意长度的序列进行操作。
52	Long-Short	[!≈ lɒŋ ʃɔ:t]	Long-Short Term Memory 18, 11 is a type of RNN unit that is specially designed to address this problem. 长短时记忆[18,11]（LSTM）是一种专门设计用于解决这个问题的RNN单元。
53	multiplicative	['mʌltɪplɪkeɪtɪv]	An LSTM (illustrated in Fig. 3) consists of a memory cell and three multiplicative gates, namely the input, output and forget gates. LSTM（图3所示）由一个存储单元和三个多重门组成，即输入，输出和遗忘门。
54	Conceptually	[kən'septʃʊəlɪ]	Conceptually, the memory cell stores the past contexts, and the input and output gates allow the cell to store contexts for a long period of time. 在概念上，存储单元存储过去的上下文，并且输入和输出门允许单元长时间地存储上下文。
55	long-range	[lɒŋ reɪndʒ]	The special design of LSTM allows it to capture long-range dependencies, which often occur in image-based sequences. LSTM的特殊设计允许它捕获长距离依赖，这经常发生在基于图像的序列中。
56	complementary	[ˌkɒmplɪˈmentri]	However, in image-based sequences, contexts from both directions are useful and complementary to each other. 然而，在基于图像的序列中，两个方向的上下文是相互有用且互补的。
57	propagate	[ˈprɒpəgeɪt]	In recurrent layers, error differentials are propagated in the opposite directions of the arrows shown in Fig. 3. b, i.e. Back-Propagation Through Time (BPTT). 在循环层中，误差在图3.b所示箭头的相反方向传播，即反向传播时间（BPTT）。 At the bottom of the recurrent layers, the sequence of propagated differentials are concatenated into maps, inverting the operation of converting feature maps into feature sequences, and fed back to the convolutional layers. 在循环层的底部，传播差异的序列被连接成映射，将特征映射转换为特征序列的操作进行反转并反馈到卷积层。
58	BPTT	[!≈ bi: pi: ti: ti:]	In recurrent layers, error differentials are propagated in the opposite directions of the arrows shown in Fig. 3. b, i.e. Back-Propagation Through Time (BPTT). 在循环层中，误差在图3.b所示箭头的相反方向传播，即反向传播时间（BPTT）。 In the recurrent layers, the Back-Propagation Through Time (BPTT) is applied to calculate the error differentials. 在循环层中，应用随时间反向传播（BPTT）来计算误差。
59	concatenate	[kɒn'kætɪneɪt]	At the bottom of the recurrent layers, the sequence of propagated differentials are concatenated into maps, inverting the operation of converting feature maps into feature sequences, and fed back to the convolutional layers. 在循环层的底部，传播差异的序列被连接成映射，将特征映射转换为特征序列的操作进行反转并反馈到卷积层。
60	invert	[ɪnˈvɜ:t]	At the bottom of the recurrent layers, the sequence of propagated differentials are concatenated into maps, inverting the operation of converting feature maps into feature sequences, and fed back to the convolutional layers. 在循环层的底部，传播差异的序列被连接成映射，将特征映射转换为特征序列的操作进行反转并反馈到卷积层。
61	Mathematically	[ˌmæθə'mætɪklɪ]	Mathematically, transcription is to find the label sequence with the highest probability conditioned on the per-frame predictions. 数学上，转录是根据每帧预测找到具有最高概率的标签序列。
62	conditional	[kənˈdɪʃənl]	We adopt the conditional probability defined in the Connectionist Temporal Classification (CTC) layer proposed by Graves et al. [15]. 我们采用Graves等人[15]提出的联接时间分类（CTC）层中定义的条件概率。 The formulation of the conditional probability is briefly described as follows: The input is a sequence $y = y_1,…,y_T$ where T is the sequence length. 条件概率的公式简要描述如下：输入是序列$y = y_1,…,y_T$，其中T是序列长度。 Then, the conditional probability is defined as the sum of probabilities of all $\boldsymbol{\pi}$ that are mapped by ${\cal B}$ onto $\mathbf{l}$: 然后，条件概率被定义为由${\cal B}$映射到$\mathbf{l}$上的所有$\boldsymbol{\pi}$的概率之和： Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation. 基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。 The objective is to minimize the negative log-likelihood of conditional probability of ground truth: 目标是最小化真实条件概率的负对数似然：
63	Connectionist	[kə'nekʃənɪst]	We adopt the conditional probability defined in the Connectionist Temporal Classification (CTC) layer proposed by Graves et al. [15]. 我们采用Graves等人[15]提出的联接时间分类（CTC）层中定义的条件概率。
64	CTC	[!≈ si: ti: si:]	We adopt the conditional probability defined in the Connectionist Temporal Classification (CTC) layer proposed by Graves et al. [15]. 我们采用Graves等人[15]提出的联接时间分类（CTC）层中定义的条件概率。
65	log-likelihood	[!≈ lɒg ˈlaɪklihʊd]	Consequently, when we use the negative log-likelihood of this probability as the objective to train the network, we only need images and their corresponding label sequences, avoiding the labor of labeling positions of individual characters. 因此，当我们使用这种概率的负对数似然作为训练网络的目标函数时，我们只需要图像及其相应的标签序列，避免了标注单个字符位置的劳动。 The objective is to minimize the negative log-likelihood of conditional probability of ground truth: 目标是最小化真实条件概率的负对数似然：
66	hh-e-l-ll-oo		For example, B maps “–hh-e-l-ll-oo–” (’-’ represents ’blank’) onto “hello”. 例如，${\cal B}$将“–hh-e-l-ll-oo–”（-表示blank）映射到“hello”。
67	stampt		where the probability of $\boldsymbol{\pi}$ is defined as $p(\boldsymbol{\pi}\|\mathbf{y})=\prod{t=1}^{T}y{\pi{t}}^{t},y{\pi{t}}^{t}$ is the probability of having label $\pi{t}$ at time stampt. $\boldsymbol{\pi}$的概率定义为$p(\boldsymbol{\pi}\|\mathbf{y})=\prod{t=1}^{T}y{\pi{t}}^{t}，y{\pi{t}}^{t}$是时刻t时有标签$\pi{t}$的概率。 The sequence $\mathbf{l}^{}$ is approximately found by $\mathbf{l}^{}\approx{\cal B}(\arg\max{\boldsymbol{\pi}}p(\boldsymbol{\pi}\|\mathbf{y}))$, i.e. taking the most probable label $\pi{t}$ at each time stampt, and map the resulted sequence onto $\mathbf{l}^{}$. 序列$\mathbf{l}^{}$通过$\mathbf{l}^{}\approx{\cal B}(\arg\max{\boldsymbol{\pi}}p(\boldsymbol{\pi}\|\mathbf{y}))$近似发现，即在每个时间戳t采用最大概率的标签$\pi{t}$，并将结果序列映射到$\mathbf{l}^{}$。
68	Eq		Directly computing Eq.1 would be computationally infeasible due to the exponentially large number of summation items. 由于存在指数级数量的求和项，直接计算方程1在计算上是不可行的。 However, Eq.1 can be efficiently computed using the forward-backward algorithm described in [15]. 然而，使用[15]中描述的前向算法可以有效计算方程1。 In this mode, the sequence $\mathbf{l}^{}$ that has the highest probability as defined in Eq.1 is taken as the prediction. 在这种模式下，将具有方程1中定义的最高概率的序列$\mathbf{l}^{}$作为预测。 Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation. 基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。
69	computationally	[!≈ ˌkɒmpjuˈteɪʃənli]	Directly computing Eq.1 would be computationally infeasible due to the exponentially large number of summation items. 由于存在指数级数量的求和项，直接计算方程1在计算上是不可行的。
70	infeasible	[ɪn'fi:zəbl]	Directly computing Eq.1 would be computationally infeasible due to the exponentially large number of summation items. 由于存在指数级数量的求和项，直接计算方程1在计算上是不可行的。
71	exponentially	[ˌekspə'nenʃəlɪ]	Directly computing Eq.1 would be computationally infeasible due to the exponentially large number of summation items. 由于存在指数级数量的求和项，直接计算方程1在计算上是不可行的。
72	summation	[sʌˈmeɪʃn]	Directly computing Eq.1 would be computationally infeasible due to the exponentially large number of summation items. 由于存在指数级数量的求和项，直接计算方程1在计算上是不可行的。
73	forward-backward	[!≈ ˈfɔ:wəd ˈbækwəd]	However, Eq.1 can be efficiently computed using the forward-backward algorithm described in [15]. 然而，使用[15]中描述的前向算法可以有效计算方程1。 In particular, in the transcription layer, error differentials are back-propagated with the forward-backward algorithm, as described in [15]. 特别地，在转录层中，如[15]所述，误差使用前向算法进行反向传播。
74	tractable	[ˈtræktəbl]	Since there exists no tractable algorithm to precisely find the solution, we use the strategy adopted in [15]. 由于不存在用于精确找到解的可行方法，我们采用[15]中的策略。
75	k-word	[!≈ keɪ wɜ:d]	Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation. 基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。 Four popular benchmarks for scene text recognition are used for performance evaluation, namely ICDAR 2003 (IC03), ICDAR 2013 (IC13), IIIT 5k-word (IIIT5k), and Street View Text (SVT). 有四个流行的基准数据集用于场景文本识别的性能评估，即ICDAR 2003（IC03），ICDAR 2013（IC13），IIIT 5k-word（IIIT5k）和Street View Text (SVT)。 Each image has been associated to a 50-words lexicon and a 1k-words lexicon. 每张图像关联一个50词的词典和一个1000词的词典。
76	Hunspell		Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation. 基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。 In addition, we use a 50k words lexicon consisting of the words in the Hunspell spell-checking dictionary [1]. 此外，我们使用由Hunspell拼写检查字典[1]中的单词组成的5万个词的词典。
77	spell-checking	[!≈ spel 'tʃekɪŋ]	Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation. 基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。 In addition, we use a 50k words lexicon consisting of the words in the Hunspell spell-checking dictionary [1]. 此外，我们使用由Hunspell拼写检查字典[1]中的单词组成的5万个词的词典。
78	time-consuming	[taɪm kən'sju:mɪŋ]	Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation. 基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。
79	exhaustive	[ɪgˈzɔ:stɪv]	Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation. 基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。
80	nearest-neighbor	['nɪərɪstn'eɪbɔ:]	This indicates that we can limit our search to the nearest-neighbor candidates ${\cal N}_{\delta}(\mathbf{l}’)$, where $\delta$ is the maximal edit distance and $\mathbf{l}’$ is the sequence transcribed from $\mathbf{y}$ in lexicon-free mode: 这表示我们可以将搜索限制在最近邻候选目标${\cal N}_{\delta}(\mathbf{l}’)$，其中$\delta$是最大编辑距离，$\mathbf{l}’$是在无词典模式下从$\mathbf{y}$转录的序列：
81	maximal	[ˈmæksɪml]	This indicates that we can limit our search to the nearest-neighbor candidates ${\cal N}_{\delta}(\mathbf{l}’)$, where $\delta$ is the maximal edit distance and $\mathbf{l}’$ is the sequence transcribed from $\mathbf{y}$ in lexicon-free mode: 这表示我们可以将搜索限制在最近邻候选目标${\cal N}_{\delta}(\mathbf{l}’)$，其中$\delta$是最大编辑距离，$\mathbf{l}’$是在无词典模式下从$\mathbf{y}$转录的序列：
82	transcribe	[trænˈskraɪb]	This indicates that we can limit our search to the nearest-neighbor candidates ${\cal N}_{\delta}(\mathbf{l}’)$, where $\delta$ is the maximal edit distance and $\mathbf{l}’$ is the sequence transcribed from $\mathbf{y}$ in lexicon-free mode: 这表示我们可以将搜索限制在最近邻候选目标${\cal N}_{\delta}(\mathbf{l}’)$，其中$\delta$是最大编辑距离，$\mathbf{l}’$是在无词典模式下从$\mathbf{y}$转录的序列：
83	BK-tree		The candidates ${\cal N}_{\delta}(\mathbf{l}’)$ can be found efficiently with the BK-tree data structure[9], which is a metric tree specifically adapted to discrete metric spaces. 可以使用BK树数据结构[9]有效地找到候选目标${\cal N}_{\delta}(\mathbf{l}’)$，这是一种专门适用于离散度量空间的度量树。 The search time complexity of BK-tree is $O(\log\|{\cal D}\|)$, where $\|{\cal D}\|$ is the lexicon size. BK树的搜索时间复杂度为$O(\log\|{\cal D}\|)$，其中$\|{\cal D}\|$是词典大小。 In our approach, a BK-tree is constructed offline for a lexicon. 在我们的方法中，一个词典离线构造一个BK树。 We implement the network within the Torch7 [10] framework, with custom implementations for the LSTM units (in Torch7/CUDA), the transcription layer (in C++) and the BK-tree data structure (in C++). 我们在Torch7[10]框架内实现了网络，使用定制实现的LSTM单元（Torch7/CUDA），转录层（C++）和BK树数据结构（C++）。 On the other hand, the computational cost grows with larger \delta, due to longer BK-tree search time, as well as larger number of candidate sequences for testing. 另一方面，由于更长的BK树搜索时间，以及更大数量的候选序列用于测试，计算成本随着\delta的增大而增加。
84	stochastic	[stə'kæstɪk]	The network is trained with stochastic gradient descent (SGD). 网络使用随机梯度下降（SGD）进行训练。
85	descent	[dɪˈsent]	The network is trained with stochastic gradient descent (SGD). 网络使用随机梯度下降（SGD）进行训练。
86	SGD	['esdʒ'i:d'i:]	The network is trained with stochastic gradient descent (SGD). 网络使用随机梯度下降（SGD）进行训练。
87	back-propagated	[!≈ bæk ˈprɔpəɡeitid]	In particular, in the transcription layer, error differentials are back-propagated with the forward-backward algorithm, as described in [15]. 特别地，在转录层中，如[15]所述，误差使用前向算法进行反向传播。
88	ADADELTA	[!≈ eɪ di: eɪ di: i: el ti: eɪ]	For optimization, we use the ADADELTA [37] to automatically calculate per-dimension learning rates. 为了优化，我们使用ADADELTA[37]自动计算每维的学习率。 Compared with the conventional momentum [31] method, ADADELTA requires no manual setting of a learning rate. 与传统的动量[31]方法相比，ADADELTA不需要手动设置学习率。 More importantly, we find that optimization using ADADELTA converges faster than the momentum method. 更重要的是，我们发现使用ADADELTA的优化收敛速度比动量方法快。 Networks are trained with ADADELTA, setting the parameter ρ to 0.9. 网络用ADADELTA训练，将参数ρ设置为0.9。
89	momentum	[məˈmentəm]	Compared with the conventional momentum [31] method, ADADELTA requires no manual setting of a learning rate. 与传统的动量[31]方法相比，ADADELTA不需要手动设置学习率。 More importantly, we find that optimization using ADADELTA converges faster than the momentum method. 更重要的是，我们发现使用ADADELTA的优化收敛速度比动量方法快。
90	synthetic	[sɪnˈθetɪk]	For all the experiments for scene text recognition, we use the synthetic dataset (Synth) released by Jaderberg et al. [20] as the training data. 对于场景文本识别的所有实验，我们使用Jaderberg等人[20]发布的合成数据集（Synth）作为训练数据。 Such images are generated by a synthetic text engine and are highly realistic. 这样的图像由合成文本引擎生成并且是非常现实的。 Our network is trained on the synthetic data once, and tested on all other real-world test datasets without any fine-tuning on their training data. 我们的网络在合成数据上进行了一次训练，并在所有其它现实世界的测试数据集上进行了测试，而没有在其训练数据上进行任何微调。 Even though the CRNN model is purely trained with synthetic text data, it works well on real images from standard text recognition benchmarks. 即使CRNN模型是在纯合成文本数据上训练，但它在标准文本识别基准数据集的真实图像上工作良好。 Our method uses only synthetic text with word level labels as the training data, very different to PhotoOCR [8] which used 7.9 millions of real word images with character-level annotations for training. 我们的方法只使用具有单词级标签的合成文本作为训练数据，与PhotoOCR[8]非常不同，后者使用790万个具有字符级标注的真实单词图像进行训练。
91	Synth	[sɪnθ]	For all the experiments for scene text recognition, we use the synthetic dataset (Synth) released by Jaderberg et al. [20] as the training data. 对于场景文本识别的所有实验，我们使用Jaderberg等人[20]发布的合成数据集（Synth）作为训练数据。
92	Jaderberg		For all the experiments for scene text recognition, we use the synthetic dataset (Synth) released by Jaderberg et al. [20] as the training data. 对于场景文本识别的所有实验，我们使用Jaderberg等人[20]发布的合成数据集（Synth）作为训练数据。
93	IC03		Four popular benchmarks for scene text recognition are used for performance evaluation, namely ICDAR 2003 (IC03), ICDAR 2013 (IC13), IIIT 5k-word (IIIT5k), and Street View Text (SVT). 有四个流行的基准数据集用于场景文本识别的性能评估，即ICDAR 2003（IC03），ICDAR 2013（IC13），IIIT 5k-word（IIIT5k）和Street View Text (SVT)。 IC03 [27] test dataset contains 251 scene images with labeled text bounding boxes. IC03[27]测试数据集包含251个具有标记文本边界框的场景图像。 IC13 [24] test dataset inherits most of its data from IC03. IC13[24]测试数据集继承了IC03中的大部分数据。 The average testing time is 0.16s/sample, as measured on IC03 without a lexicon. 平均测试时间为0.16s/样本，在IC03上测得的，没有词典。 The approximate lexicon search is applied to the 50k lexicon of IC03, with the parameter δ set to 3. 近似词典搜索应用于IC03的50k词典，参数δ设置为3。 Specifically, we obtain superior performance on IIIT5k, and SVT compared to [22], only achieved lower performance on IC03 with the “Full” lexicon. 具体来说，与[22]相比，我们在IIIT5k和SVT上获得了卓越的性能，仅在IC03上通过“Full”词典实现了较低性能。 In the unconstrained lexicon cases, our method achieves the best performance on SVT, yet, is still behind some approaches [8, 22] on IC03 and IC13. 在无约束词典的情况下，我们的方法在SVT上仍取得了最佳性能，但在IC03和IC13上仍然落后于一些方法[8,22]。 Tested on the IC03 dataset with the 50k lexicon. 在IC03数据集上使用50k词典进行的测试。
94	IC13		Four popular benchmarks for scene text recognition are used for performance evaluation, namely ICDAR 2003 (IC03), ICDAR 2013 (IC13), IIIT 5k-word (IIIT5k), and Street View Text (SVT). 有四个流行的基准数据集用于场景文本识别的性能评估，即ICDAR 2003（IC03），ICDAR 2013（IC13），IIIT 5k-word（IIIT5k）和Street View Text (SVT)。 IC13 [24] test dataset inherits most of its data from IC03. IC13[24]测试数据集继承了IC03中的大部分数据。 In the unconstrained lexicon cases, our method achieves the best performance on SVT, yet, is still behind some approaches [8, 22] on IC03 and IC13. 在无约束词典的情况下，我们的方法在SVT上仍取得了最佳性能，但在IC03和IC13上仍然落后于一些方法[8,22]。
95	IIIT	[!≈ aɪ aɪ aɪ ti:]	Four popular benchmarks for scene text recognition are used for performance evaluation, namely ICDAR 2003 (IC03), ICDAR 2013 (IC13), IIIT 5k-word (IIIT5k), and Street View Text (SVT). 有四个流行的基准数据集用于场景文本识别的性能评估，即ICDAR 2003（IC03），ICDAR 2013（IC13），IIIT 5k-word（IIIT5k）和Street View Text (SVT)。
96	IIIT5k		Four popular benchmarks for scene text recognition are used for performance evaluation, namely ICDAR 2003 (IC03), ICDAR 2013 (IC13), IIIT 5k-word (IIIT5k), and Street View Text (SVT). 有四个流行的基准数据集用于场景文本识别的性能评估，即ICDAR 2003（IC03），ICDAR 2013（IC13），IIIT 5k-word（IIIT5k）和Street View Text (SVT)。 IIIT5k [28] contains 3,000 cropped word test images collected from the Internet. IIIT5k[28]包含从互联网收集的3000张裁剪的词测试图像。 Specifically, we obtain superior performance on IIIT5k, and SVT compared to [22], only achieved lower performance on IC03 with the “Full” lexicon. 具体来说，与[22]相比，我们在IIIT5k和SVT上获得了卓越的性能，仅在IC03上通过“Full”词典实现了较低性能。
97	SVT	[!≈ es vi: ti:]	Four popular benchmarks for scene text recognition are used for performance evaluation, namely ICDAR 2003 (IC03), ICDAR 2013 (IC13), IIIT 5k-word (IIIT5k), and Street View Text (SVT). 有四个流行的基准数据集用于场景文本识别的性能评估，即ICDAR 2003（IC03），ICDAR 2013（IC13），IIIT 5k-word（IIIT5k）和Street View Text (SVT)。 SVT [34] test dataset consists of 249 street view images collected from Google Street View. SVT[34]测试数据集由从Google街景视图收集的249张街景图像组成。 Specifically, we obtain superior performance on IIIT5k, and SVT compared to [22], only achieved lower performance on IC03 with the “Full” lexicon. 具体来说，与[22]相比，我们在IIIT5k和SVT上获得了卓越的性能，仅在IC03上通过“Full”词典实现了较低性能。 In the unconstrained lexicon cases, our method achieves the best performance on SVT, yet, is still behind some approaches [8, 22] on IC03 and IC13. 在无约束词典的情况下，我们的方法在SVT上仍取得了最佳性能，但在IC03和IC13上仍然落后于一些方法[8,22]。
98	bounding	[baundɪŋ]	IC03 [27] test dataset contains 251 scene images with labeled text bounding boxes. IC03[27]测试数据集包含251个具有标记文本边界框的场景图像。 Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions). 比较的属性包括：1)端到端训练(E2E Train)；2)从图像中直接学习卷积特征而不是使用手动设计的特征(Conv Ftrs)；3)训练期间不需要字符的实际边界框(CharGT-Free)；4)不受限于预定义字典(Unconstrained)；5)模型大小（如果使用端到端模型），通过模型参数数量来衡量(Model Size, M表示百万)。
99	non-alphanumeric	[!≈ nɒn ˌælfənju:ˈmerɪk]	Following Wang et al. [34], we ignore images that either contain non-alphanumeric characters or have less than three characters, and get a test set with 860 cropped text images. 王等人[34]，我们忽略包含非字母数字字符或少于三个字符的图像，并获得具有860个裁剪的文本图像的测试集。
100	VGG-VeryDeep		The architecture of the convolutional layers is based on the VGG-VeryDeep architectures [32]. 卷积层的架构是基于VGG-VeryDeep的架构[32]。
101	tweak	[twi:k]	A tweak is made in order to make it suitable for recognizing English texts. 为了使其适用于识别英文文本，对其进行了调整。 This tweak yields feature maps with larger width, hence longer feature sequence. 这种调整产生宽度较大的特征图，因此具有更长的特征序列。
102	rectangular	[rek'tæŋɡjələ(r)]	In the 3rd and the 4th max-pooling layers, we adopt 1 × 2 sized rectangular pooling windows instead of the conventional squared ones. 在第3和第4个最大池化层中，我们采用1×2大小的矩形池化窗口而不是传统的平方形。 On top of that, the rectangular pooling windows yield rectangular receptive fields (illustrated in Fig. 2), which are beneficial for recognizing some characters that have narrow shapes, such as ’i’ and ’l’. 最重要的是，矩形池窗口产生矩形感受野（如图2所示），这有助于识别一些具有窄形状的字符，例如i和l。 On top of that, the rectangular pooling windows yield rectangular receptive fields (illustrated in Fig. 2), which are beneficial for recognizing some characters that have narrow shapes, such as ’i’ and ’l’. 最重要的是，矩形池窗口产生矩形感受野（如图2所示），这有助于识别一些具有窄形状的字符，例如i和l。
103	Xeon		Experiments are carried out on a workstation with a 2.50 GHz Intel(R) Xeon(R) E5-2609 CPU, 64GB RAM and an NVIDIA(R) Tesla(TM) K40 GPU. 实验在具有2.50 GHz Intel（R）Xeon E5-2609 CPU，64GB RAM和NVIDIA（R）Tesla(TM) K40 GPU的工作站上进行。
104	RAM	[ræm]	Experiments are carried out on a workstation with a 2.50 GHz Intel(R) Xeon(R) E5-2609 CPU, 64GB RAM and an NVIDIA(R) Tesla(TM) K40 GPU. 实验在具有2.50 GHz Intel（R）Xeon E5-2609 CPU，64GB RAM和NVIDIA（R）Tesla(TM) K40 GPU的工作站上进行。 Our model has 8.3 million parameters, taking only 33MB RAM (using 4-bytes single-precision float for each parameter), thus it can be easily ported to mobile devices. 我们的模型有830万个参数，只有33MB RAM（每个参数使用4字节单精度浮点数），因此可以轻松地移植到移动设备上。
105	NVIDIA	[ɪn'vɪdɪə]	Experiments are carried out on a workstation with a 2.50 GHz Intel(R) Xeon(R) E5-2609 CPU, 64GB RAM and an NVIDIA(R) Tesla(TM) K40 GPU. 实验在具有2.50 GHz Intel（R）Xeon E5-2609 CPU，64GB RAM和NVIDIA（R）Tesla(TM) K40 GPU的工作站上进行。
106	Tesla	['teslә]	Experiments are carried out on a workstation with a 2.50 GHz Intel(R) Xeon(R) E5-2609 CPU, 64GB RAM and an NVIDIA(R) Tesla(TM) K40 GPU. 实验在具有2.50 GHz Intel（R）Xeon E5-2609 CPU，64GB RAM和NVIDIA（R）Tesla(TM) K40 GPU的工作站上进行。
107	TM	[!≈ ti: em]	Experiments are carried out on a workstation with a 2.50 GHz Intel(R) Xeon(R) E5-2609 CPU, 64GB RAM and an NVIDIA(R) Tesla(TM) K40 GPU. 实验在具有2.50 GHz Intel（R）Xeon E5-2609 CPU，64GB RAM和NVIDIA（R）Tesla(TM) K40 GPU的工作站上进行。
108	K40		Experiments are carried out on a workstation with a 2.50 GHz Intel(R) Xeon(R) E5-2609 CPU, 64GB RAM and an NVIDIA(R) Tesla(TM) K40 GPU. 实验在具有2.50 GHz Intel（R）Xeon E5-2609 CPU，64GB RAM和NVIDIA（R）Tesla(TM) K40 GPU的工作站上进行。
109	convergence	[kən'vɜ:dʒəns]	The training process takes about 50 hours to reach convergence. 训练过程大约需要50个小时才能达到收敛。
110	proportionally	[prə'pɔ:ʃənlɪ]	Widths are proportionally scaled with heights, but at least 100 pixels. 宽度与高度成比例地缩放，但至少为100像素。
111	Comparative	[kəmˈpærətɪv]	3.3. Comparative Evaluation 3.3. 比较评估
112	consistently	[kən'sɪstəntlɪ]	In the constrained lexicon cases, our method consistently outperforms most state-of-the-arts approaches, and in average beats the best text reader proposed in [22]. 在有约束词典的情况中，我们的方法始终优于大多数最新的方法，并且平均打败了[22]中提出的最佳文本阅读器。
113	PhotoOCR		Our method uses only synthetic text with word level labels as the training data, very different to PhotoOCR [8] which used 7.9 millions of real word images with character-level annotations for training. 我们的方法只使用具有单词级标签的合成文本作为训练数据，与PhotoOCR[8]非常不同，后者使用790万个具有字符级标注的真实单词图像进行训练。
114	character-level	[!≈ ˈkærəktə(r) ˈlevl]	Our method uses only synthetic text with word level labels as the training data, very different to PhotoOCR [8] which used 7.9 millions of real word images with character-level annotations for training. 我们的方法只使用具有单词级标签的合成文本作为训练数据，与PhotoOCR[8]非常不同，后者使用790万个具有字符级标注的真实单词图像进行训练。 CharGT-Free: This column is to indicate whether the character-level annotations are essential for training the model. CharGT-Free：这一列用来表明字符级标注对于训练模型是否是必要的。 As the input and output labels of CRNN can be a sequence, character-level annotations are not necessary. 由于CRNN的输入和输出标签是序列，因此字符级标注是不必要的。
115	persformance		The best persformance is reported by [22] in the unconstrained lexicon cases, benefiting from its large dictionary, however, it is not a model strictly unconstrained to a lexicon as mentioned before. [22]中报告的最佳性能是在无约束词典的情况下，受益于它的大字典，然而，它不是前面提到的严格的无约束词典模型。
116	E2E		For further understanding the advantages of the proposed algorithm over other text recognition approaches, we provide a comprehensive comparison on several properties named E2E Train, Conv Ftrs, CharGT-Free, Unconstrained, and Model Size, as summarized in Table 3. 为了进一步了解与其它文本识别方法相比，所提出算法的优点，我们提供了在一些特性上的综合比较，这些特性名称为E2E Train，Conv Ftrs，CharGT-Free，Unconstrained和Model Size，如表3所示。 Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions). 比较的属性包括：1)端到端训练(E2E Train)；2)从图像中直接学习卷积特征而不是使用手动设计的特征(Conv Ftrs)；3)训练期间不需要字符的实际边界框(CharGT-Free)；4)不受限于预定义字典(Unconstrained)；5)模型大小（如果使用端到端模型），通过模型参数数量来衡量(Model Size, M表示百万)。 E2E Train: This column is to show whether a certain text reading model is end-to-end trainable, without any preprocess or through several separated steps, which indicates such approaches are elegant and clean for training. E2E Train：这一列是为了显示某种文字阅读模型是否可以进行端到端的训练，无需任何预处理或经过几个分离的步骤，这表明这种方法对于训练是优雅且干净的。
117	Ftrs		For further understanding the advantages of the proposed algorithm over other text recognition approaches, we provide a comprehensive comparison on several properties named E2E Train, Conv Ftrs, CharGT-Free, Unconstrained, and Model Size, as summarized in Table 3. 为了进一步了解与其它文本识别方法相比，所提出算法的优点，我们提供了在一些特性上的综合比较，这些特性名称为E2E Train，Conv Ftrs，CharGT-Free，Unconstrained和Model Size，如表3所示。 Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions). 比较的属性包括：1)端到端训练(E2E Train)；2)从图像中直接学习卷积特征而不是使用手动设计的特征(Conv Ftrs)；3)训练期间不需要字符的实际边界框(CharGT-Free)；4)不受限于预定义字典(Unconstrained)；5)模型大小（如果使用端到端模型），通过模型参数数量来衡量(Model Size, M表示百万)。 Conv Ftrs: This column is to indicate whether an approach uses the convolutional features learned from training images directly or handcraft features as the basic representations. Conv Ftrs：这一列用来表明一个方法是否使用从训练图像直接学习到的卷积特征或手动特征作为基本的表示。
118	CharGT-Free		For further understanding the advantages of the proposed algorithm over other text recognition approaches, we provide a comprehensive comparison on several properties named E2E Train, Conv Ftrs, CharGT-Free, Unconstrained, and Model Size, as summarized in Table 3. 为了进一步了解与其它文本识别方法相比，所提出算法的优点，我们提供了在一些特性上的综合比较，这些特性名称为E2E Train，Conv Ftrs，CharGT-Free，Unconstrained和Model Size，如表3所示。 Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions). 比较的属性包括：1)端到端训练(E2E Train)；2)从图像中直接学习卷积特征而不是使用手动设计的特征(Conv Ftrs)；3)训练期间不需要字符的实际边界框(CharGT-Free)；4)不受限于预定义字典(Unconstrained)；5)模型大小（如果使用端到端模型），通过模型参数数量来衡量(Model Size, M表示百万)。 CharGT-Free: This column is to indicate whether the character-level annotations are essential for training the model. CharGT-Free：这一列用来表明字符级标注对于训练模型是否是必要的。
119	hand-crafted	[,hænd 'kra:ftid]	Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions). 比较的属性包括：1)端到端训练(E2E Train)；2)从图像中直接学习卷积特征而不是使用手动设计的特征(Conv Ftrs)；3)训练期间不需要字符的实际边界框(CharGT-Free)；4)不受限于预定义字典(Unconstrained)；5)模型大小（如果使用端到端模型），通过模型参数数量来衡量(Model Size, M表示百万)。
120	handcraft	[ˈhændkrɑ:ft]	Conv Ftrs: This column is to indicate whether an approach uses the convolutional features learned from training images directly or handcraft features as the basic representations. Conv Ftrs：这一列用来表明一个方法是否使用从训练图像直接学习到的卷积特征或手动特征作为基本的表示。
121	incremental	[ˌɪŋkrə'mentl]	Notice that though the recent models learned by label embedding [5, 14] and incremental learning [22] achieved highly competitive performance, they are constrained to a specific dictionary. 注意尽管最近通过标签嵌入[5, 14]和增强学习[22]学习到的模型取得了非常有竞争力的性能，但它们受限于一个特定的字典。
122	weight-sharing	[!≈ weɪt 'ʃeərɪŋ]	In CRNN, all layers have weight-sharing connections, and the fully-connected layers are not needed. 在CRNN中，所有的层有权重共享连接，不需要全连接层。
123	variant	[ˈveəriənt]	Consequently, the number of parameters of CRNN is much less than the models learned on the variants of CNN [22, 21], resulting in a much smaller model compared with [22, 21]. 因此，CRNN的参数数量远小于CNN变体[22,21]所得到的模型，导致与[22,21]相比，模型要小得多。
124	MB	[!≈ em bi:]	Our model has 8.3 million parameters, taking only 33MB RAM (using 4-bytes single-precision float for each parameter), thus it can be easily ported to mobile devices. 我们的模型有830万个参数，只有33MB RAM（每个参数使用4字节单精度浮点数），因此可以轻松地移植到移动设备上。
125	Eq.2.		In addition, to test the impact of parameter \delta, we experiment different values of \delta in Eq.2. 另外，为了测试参数\delta的影响，我们在方程2中实验了\delta的不同值。
126	tradeoff	['treɪdˌɔ:f]	In practice, we choose \delta=3 as a tradeoff between accuracy and speed. 实际上，我们选择\delta=3作为精度和速度之间的折衷。
127	OMR	[!≈ əu em ɑ:(r)]	Recognizing musical scores in images is known as the Optical Music Recognition (OMR) problem. 识别图像中的乐谱被称为光学音乐识别（OMR）问题。 We cast the OMR as a sequence recognition problem, and predict a sequence of musical notes directly from the image with CRNN. 我们将OMR作为序列识别问题，直接用CRNN从图像中预测音符的序列。 For comparison, we evaluate two commercial OMR engines, namely the Capella Scan [3] and the PhotoScore [4]. 为了比较，我们评估了两种商用OMR引擎，即Capella Scan[3]和PhotoScore[4]。 Comparison of pitch recognition accuracies, among CRNN and two commercial OMR systems, on the three datasets we have collected. 在我们收集的数据集上，CRNN和两个商业OMR系统对音调识别准确率的对比。 But it provides a new scheme for OMR, and has shown promising capabilities in pitch recognition. 但它为OMR提供了一个新的方案，并且在音高识别方面表现出有前途的能力。 In addition, CRNN significantly outperforms other competitors on a benchmark for Optical Music Recognition (OMR), which verifies the generality of CRNN. 此外，CRNN在光学音乐识别（OMR）的基准数据集上显著优于其它的竞争者，这验证了CRNN的泛化性。
128	binirization		Previous methods often requires image preprocessing (mostly binirization), staff lines detection and individual notes recognition [29]. 以前的方法通常需要图像预处理（主要是二值化），五线谱检测和单个音符识别[29]。
129	pitch	[pɪtʃ]	For simplicity, we recognize pitches only, ignore all chords and assume the same major scales (C major) for all scores. 为了简单起见，我们仅认识音调，忽略所有和弦，并假定所有乐谱具有相同的大调音阶（C大调）。 To the best of our knowledge, there exists no public datasets for evaluating algorithms on pitch recognition. 据我们所知，没有用于评估音调识别算法的公共数据集。 Two measures are used for evaluating the recognition performance: 1) fragment accuracy, i.e. the percentage of score fragments correctly recognized; 2) average edit distance, i.e. the average edit distance between predicted pitch sequences and the ground truths. 使用两种方法来评估识别性能：1）片段准确度，即正确识别的乐谱片段的百分比；2）平均编辑距离，即预测音调序列与真实值之间的平均编辑距离。 Comparison of pitch recognition accuracies, among CRNN and two commercial OMR systems, on the three datasets we have collected. 在我们收集的数据集上，CRNN和两个商业OMR系统对音调识别准确率的对比。 But it provides a new scheme for OMR, and has shown promising capabilities in pitch recognition. 但它为OMR提供了一个新的方案，并且在音高识别方面表现出有前途的能力。
130	ezpitches		We manually label the ground truth label sequences (sequences of not ezpitches) for all the images. 我们手动标记所有图像的真实标签序列（不是的音调序列）。
131	augment	[ɔ:gˈment]	The collected images are augmented to 265k training samples by being rotated, scaled and corrupted with noise, and by replacing their backgrounds with natural images. 收集到的图像通过旋转，缩放和用噪声损坏增强到了265k个训练样本，并用自然图像替换它们的背景。
132	synthesize	[ˈsɪnθəsaɪz]	a; 2) “Synthesized”, which is created from “Clean”, using the augmentation strategy mentioned above. 实例如图5.a所示；2）“合成的”，使用“纯净的”创建的，使用了上述的增强策略。 Figure 5. (a) Clean musical scores images collected from [2] (b) Synthesized musical score images. 图5。(a)从[2]中收集的干净的乐谱图像。(b)合成的乐谱图像。 The CRNN outperforms the two commercial systems by a large margin. The Capella Scan and PhotoScore systems perform reasonably well on the Clean dataset, but their performances drop significantly on synthesized and real-world data. Capella Scan和PhotoScore系统在干净的数据集上表现相当不错，但是它们的性能在合成和现实世界数据方面显著下降。 The main reason is that they rely on robust binarization to detect staff lines and notes, but the binarization step often fails on synthesized and real-world data due to bad lighting condition, noise corruption and cluttered background. 主要原因是它们依赖于强大的二值化来检五线谱和音符，但是由于光线不良，噪音破坏和杂乱的背景，二值化步骤经常会在合成数据和现实数据上失败。
133	augmentation	[ˌɔ:ɡmen'teɪʃn]	a; 2) “Synthesized”, which is created from “Clean”, using the augmentation strategy mentioned above. 实例如图5.a所示；2）“合成的”，使用“纯净的”创建的，使用了上述的增强策略。
134	Tab	[tæb]	Different from the configuration specified in Tab. 1, the 4th and 6th convolution layers are removed, and the 2-layer bidirectional LSTM is replaced by a 2-layer single directional LSTM. 与表1中指定的配置不同，我们移除了第4和第6卷积层，将2层双向LSTM替换为2层单向LSTM。 Tab. 表4总结了结果。
135	Capella	[kəˈpelə]	For comparison, we evaluate two commercial OMR engines, namely the Capella Scan [3] and the PhotoScore [4]. 为了比较，我们评估了两种商用OMR引擎，即Capella Scan[3]和PhotoScore[4]。 The CRNN outperforms the two commercial systems by a large margin. The Capella Scan and PhotoScore systems perform reasonably well on the Clean dataset, but their performances drop significantly on synthesized and real-world data. Capella Scan和PhotoScore系统在干净的数据集上表现相当不错，但是它们的性能在合成和现实世界数据方面显著下降。 Compared with Capella Scan and PhotoScore, our CRNN-based system is still preliminary and misses many functionalities. 与Capella Scan和PhotoScore相比，我们的基于CRNN的系统仍然是初步的，并且缺少许多功能。
136	PhotoScore		For comparison, we evaluate two commercial OMR engines, namely the Capella Scan [3] and the PhotoScore [4]. 为了比较，我们评估了两种商用OMR引擎，即Capella Scan[3]和PhotoScore[4]。 The CRNN outperforms the two commercial systems by a large margin. The Capella Scan and PhotoScore systems perform reasonably well on the Clean dataset, but their performances drop significantly on synthesized and real-world data. Capella Scan和PhotoScore系统在干净的数据集上表现相当不错，但是它们的性能在合成和现实世界数据方面显著下降。 Compared with Capella Scan and PhotoScore, our CRNN-based system is still preliminary and misses many functionalities. 与Capella Scan和PhotoScore相比，我们的基于CRNN的系统仍然是初步的，并且缺少许多功能。
137	clutter	[ˈklʌtə(r)]	The main reason is that they rely on robust binarization to detect staff lines and notes, but the binarization step often fails on synthesized and real-world data due to bad lighting condition, noise corruption and cluttered background. 主要原因是它们依赖于强大的二值化来检五线谱和音符，但是由于光线不良，噪音破坏和杂乱的背景，二值化步骤经常会在合成数据和现实数据上失败。
138	minimal	[ˈmɪnɪməl]	The results have shown the generality of CRNN, in that it can be readily applied to other image-based sequence recognition problems, requiring minimal domain knowledge. 结果显示了CRNN的泛化性，因为它可以很容易地应用于其它的基于图像的序列识别问题，需要极少的领域知识。
139	preliminary	[prɪˈlɪmɪnəri]	Compared with Capella Scan and PhotoScore, our CRNN-based system is still preliminary and misses many functionalities. 与Capella Scan和PhotoScore相比，我们的基于CRNN的系统仍然是初步的，并且缺少许多功能。
140	functionality	[ˌfʌŋkʃəˈnæləti]	Compared with Capella Scan and PhotoScore, our CRNN-based system is still preliminary and misses many functionalities. 与Capella Scan和PhotoScore相比，我们的基于CRNN的系统仍然是初步的，并且缺少许多功能。

Words List (frequency)
#	word (frequency)	phonetic	sentence
1	lexicon (32)	[ˈleksɪkən]	(3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks.（3）它不仅限于任何预定义的词汇，并且在无词典和基于词典的场景文本识别任务中都取得了显著的表现。 A lexicon is a set of label sequences that prediction is constraint to, e.g. a spell checking dictionary.词典是一组标签序列，预测受拼写检查字典约束。 In lexicon-free mode, predictions are made without any lexicon.在无词典模式中，预测时没有任何词典。 In lexicon-based mode, each test sample is associated with a lexicon {\cal D}.在基于字典的模式中，每个测试采样与词典{\cal D}相关联。 Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation.基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。 Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation.基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。 Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation.基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。 1 for all sequences in the lexicon and choose the one with the highest probability. To solve this problem, we observe that the label sequences predicted via lexicon-free transcription, described in 2.3.2, are often close to the ground-truth under the edit distance metric.然而，对于大型词典，例如5万个词的Hunspell拼写检查词典[1]，对词典进行详尽的搜索是非常耗时的，即对词典中的所有序列计算方程1，并选择概率最高的一个。为了解决这个问题，我们观察到，2.3.2中描述的通过无词典转录预测的标签序列通常在编辑距离度量下接近于实际结果。 The search time complexity of BK-tree is $O(\log\|{\cal D}\|)$, where $\|{\cal D}\|$ is the lexicon size.BK树的搜索时间复杂度为$O(\log\|{\cal D}\|)$，其中$\|{\cal D}\|$是词典大小。 Therefore this scheme readily extends to very large lexicons.因此，这个方案很容易扩展到非常大的词典。 In our approach, a BK-tree is constructed offline for a lexicon.在我们的方法中，一个词典离线构造一个BK树。 Each test image is associated with a 50-words lexicon which is defined by Wang et al. [34].每张测试图像与由Wang等人[34]定义的50词的词典相关联。 A full lexicon is built by combining all the per-image lexicons.通过组合所有的每张图像词汇构建完整的词典。 A full lexicon is built by combining all the per-image lexicons.通过组合所有的每张图像词汇构建完整的词典。 In addition, we use a 50k words lexicon consisting of the words in the Hunspell spell-checking dictionary [1].此外，我们使用由Hunspell拼写检查字典[1]中的单词组成的5万个词的词典。 Each image has been associated to a 50-words lexicon and a 1k-words lexicon.每张图像关联一个50词的词典和一个1000词的词典。 Each image has been associated to a 50-words lexicon and a 1k-words lexicon.每张图像关联一个50词的词典和一个1000词的词典。 Each word image has a 50 words lexicon defined by Wang et al. [34].每张单词图像都有一个由Wang等人[34]定义的50个词的词典。 The average testing time is 0.16s/sample, as measured on IC03 without a lexicon.平均测试时间为0.16s/样本，在IC03上测得的，没有词典。 The approximate lexicon search is applied to the 50k lexicon of IC03, with the parameter δ set to 3.近似词典搜索应用于IC03的50k词典，参数δ设置为3。 The approximate lexicon search is applied to the 50k lexicon of IC03, with the parameter δ set to 3.近似词典搜索应用于IC03的50k词典，参数δ设置为3。 In the second row, “50”, “1k”, “50k” and “Full” denote the lexicon used, and “None” denotes recognition without a lexicon.在第二行，“50”，“1k”，“50k”和“Full”表示使用的字典，“None”表示识别没有字典。 In the second row, “50”, “1k”, “50k” and “Full” denote the lexicon used, and “None” denotes recognition without a lexicon.在第二行，“50”，“1k”，“50k”和“Full”表示使用的字典，“None”表示识别没有字典。 In the constrained lexicon cases, our method consistently outperforms most state-of-the-arts approaches, and in average beats the best text reader proposed in [22].在有约束词典的情况中，我们的方法始终优于大多数最新的方法，并且平均打败了[22]中提出的最佳文本阅读器。 Specifically, we obtain superior performance on IIIT5k, and SVT compared to [22], only achieved lower performance on IC03 with the “Full” lexicon.具体来说，与[22]相比，我们在IIIT5k和SVT上获得了卓越的性能，仅在IC03上通过“Full”词典实现了较低性能。 In the unconstrained lexicon cases, our method achieves the best performance on SVT, yet, is still behind some approaches [8, 22] on IC03 and IC13.在无约束词典的情况下，我们的方法在SVT上仍取得了最佳性能，但在IC03和IC13上仍然落后于一些方法[8,22]。 Note that the blanks in the “none” columns of Table 2 denote that such approaches are unable to be applied to recognition without lexicon or did not report the recognition accuracies in the unconstrained cases.注意，表2的“none”列中的空白表示这种方法不能应用于没有词典的识别，或者在无约束的情况下不能报告识别精度。 The best persformance is reported by [22] in the unconstrained lexicon cases, benefiting from its large dictionary, however, it is not a model strictly unconstrained to a lexicon as mentioned before.[22]中报告的最佳性能是在无约束词典的情况下，受益于它的大字典，然而，它不是前面提到的严格的无约束词典模型。 The best persformance is reported by [22] in the unconstrained lexicon cases, benefiting from its large dictionary, however, it is not a model strictly unconstrained to a lexicon as mentioned before.[22]中报告的最佳性能是在无约束词典的情况下，受益于它的大字典，然而，它不是前面提到的严格的无约束词典模型。 In this sense, our results in the unconstrained lexicon case are still promising.在这个意义上，我们在无限制词典表中的结果仍然是有前途的。 Red bars: lexicon search time per sample.红条：每个样本的词典搜索时间。 Tested on the IC03 dataset with the 50k lexicon.在IC03数据集上使用50k词典进行的测试。
2	recurrent (21)	[rɪˈkʌrənt]	Recurrent neural networks (RNN) models, another important branch of the deep neural networks family, were mainly designed for handling sequences.循环神经网络（RNN）模型是深度神经网络家族中的另一个重要分支，主要是设计来处理序列。 The proposed neural network model is named as Convolutional Recurrent Neural Network (CRNN), since it is a combination of DCNN and RNN.所提出的神经网络模型被称为卷积循环神经网络（CRNN），因为它是DCNN和RNN的组合。 The network architecture of CRNN, as shown in Fig. 1, consists of three components, including the convolutional layers, the recurrent layers, and a transcription layer, from bottom to top.如图1所示，CRNN的网络架构由三部分组成，包括卷积层，循环层和转录层，从底向上。 2) recurrent layers, which predict a label distribution for each frame; 3) transcription layer, which translates the per-frame predictions into the final label sequence.2) 循环层，预测每一帧的标签分布；3) 转录层，将每一帧的预测变为最终的标签序列。 On top of the convolutional network, a recurrent network is built for making prediction for each frame of the feature sequence, outputted by the convolutional layers.在卷积网络之上，构建了一个循环网络，用于对卷积层输出的特征序列的每一帧进行预测。 The transcription layer at the top of CRNN is adopted to translate the per-frame predictions by the recurrent layers into a label sequence.采用CRNN顶部的转录层将循环层的每帧预测转化为标签序列。 Then a sequence of feature vectors is extracted from the feature maps produced by the component of convolutional layers, which is the input for the recurrent layers.然后从卷积层组件产生的特征图中提取特征向量序列，这些特征向量序列作为循环层的输入。 A deep bidirectional Recurrent Neural Network is built on the top of the convolutional layers, as the recurrent layers.一个深度双向循环神经网络是建立在卷积层的顶部，作为循环层。 A deep bidirectional Recurrent Neural Network is built on the top of the convolutional layers, as the recurrent layers.一个深度双向循环神经网络是建立在卷积层的顶部，作为循环层。 The recurrent layers predict a label distribution $y_t$ for each frame $x_t$ in the feature sequence $x = x_1,…,x_T$.循环层预测特征序列$x = x_1,…,x_T$中每一帧$x_t$的标签分布$y_t$。 The advantages of the recurrent layers are three-fold.循环层的优点是三重的。 Secondly, RNN can back-propagates error differentials to its input, i.e. the convolutional layer, allowing us to jointly train the recurrent layers and the convolutional layers in a unified network.其次，RNN可以将误差差值反向传播到其输入，即卷积层，从而允许我们在统一的网络中共同训练循环层和卷积层。 In recurrent layers, error differentials are propagated in the opposite directions of the arrows shown in Fig. 3. b, i.e. Back-Propagation Through Time (BPTT).在循环层中，误差在图3.b所示箭头的相反方向传播，即反向传播时间（BPTT）。 At the bottom of the recurrent layers, the sequence of propagated differentials are concatenated into maps, inverting the operation of converting feature maps into feature sequences, and fed back to the convolutional layers.在循环层的底部，传播差异的序列被连接成映射，将特征映射转换为特征序列的操作进行反转并反馈到卷积层。 In practice, we create a custom network layer, called “Map-to-Sequence”, as the bridge between convolutional layers and recurrent layers.实际上，我们创建一个称为“Map-to-Sequence”的自定义网络层，作为卷积层和循环层之间的桥梁。 where $\mathbf{y}_{i}$ is the sequence produced by the recurrent and convolutional layers from $I_{i}$.$\mathbf{y}_{i}$是循环层和卷积层从I{i}$生成的序列。 In the recurrent layers, the Back-Propagation Through Time (BPTT) is applied to calculate the error differentials.在循环层中，应用随时间反向传播（BPTT）来计算误差。 The network not only has deep convolutional layers, but also has recurrent layers.网络不仅有深度卷积层，而且还有循环层。 Besides, recurrent layers in CRNN can utilize contextual information in the score.此外，CRNN中的循环层可以利用乐谱中的上下文信息。 In this paper, we have presented a novel neural network architecture, called Convolutional Recurrent Neural Network (CRNN), which integrates the advantages of both Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN).在本文中，我们提出了一种新颖的神经网络架构，称为卷积循环神经网络（CRNN），其集成了卷积神经网络（CNN）和循环神经网络（RNN）的优点。 In this paper, we have presented a novel neural network architecture, called Convolutional Recurrent Neural Network (CRNN), which integrates the advantages of both Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN).在本文中，我们提出了一种新颖的神经网络架构，称为卷积循环神经网络（CRNN），其集成了卷积神经网络（CNN）和循环神经网络（RNN）的优点。
3	transcription (15)	[trænˈskrɪpʃn]	A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed.提出了一种将特征提取，序列建模和转录整合到统一框架中的新型神经网络架构。 The network architecture of CRNN, as shown in Fig. 1, consists of three components, including the convolutional layers, the recurrent layers, and a transcription layer, from bottom to top.如图1所示，CRNN的网络架构由三部分组成，包括卷积层，循环层和转录层，从底向上。 2) recurrent layers, which predict a label distribution for each frame; 3) transcription layer, which translates the per-frame predictions into the final label sequence.2) 循环层，预测每一帧的标签分布；3) 转录层，将每一帧的预测变为最终的标签序列。 The transcription layer at the top of CRNN is adopted to translate the per-frame predictions by the recurrent layers into a label sequence.采用CRNN顶部的转录层将循环层的每帧预测转化为标签序列。 2.3. Transcription2.3. 转录 Transcription is the process of converting the per-frame predictions made by RNN into a label sequence.转录是将RNN所做的每帧预测转换成标签序列的过程。 Mathematically, transcription is to find the label sequence with the highest probability conditioned on the per-frame predictions.数学上，转录是根据每帧预测找到具有最高概率的标签序列。 In practice, there exists two modes of transcription, namely the lexicon-free and lexicon-based transcriptions.在实践中，存在两种转录模式，即无词典转录和基于词典的转录。 In practice, there exists two modes of transcription, namely the lexicon-free and lexicon-based transcriptions.在实践中，存在两种转录模式，即无词典转录和基于词典的转录。 2.3.2 Lexicon-free transcription2.3.2 无字典转录 2.3.3 Lexicon-based transcription2.3.3 基于词典的转录 1 for all sequences in the lexicon and choose the one with the highest probability. To solve this problem, we observe that the label sequences predicted via lexicon-free transcription, described in 2.3.2, are often close to the ground-truth under the edit distance metric.然而，对于大型词典，例如5万个词的Hunspell拼写检查词典[1]，对词典进行详尽的搜索是非常耗时的，即对词典中的所有序列计算方程1，并选择概率最高的一个。为了解决这个问题，我们观察到，2.3.2中描述的通过无词典转录预测的标签序列通常在编辑距离度量下接近于实际结果。 In particular, in the transcription layer, error differentials are back-propagated with the forward-backward algorithm, as described in [15].特别地，在转录层中，如[15]所述，误差使用前向算法进行反向传播。 We implement the network within the Torch7 [10] framework, with custom implementations for the LSTM units (in Torch7/CUDA), the transcription layer (in C++) and the BK-tree data structure (in C++).我们在Torch7[10]框架内实现了网络，使用定制实现的LSTM单元（Torch7/CUDA），转录层（C++）和BK树数据结构（C++）。 Larger \delta results in more candidates, thus more accurate lexicon-based transcription.更大的\delta导致更多的候选目标，从而基于词典的转录更准确。
4	e.g. (9)	[ˌi: ˈdʒi:]	Some attempts have been made to address this problem for a specific sequence-like object (e.g. scene text).已经针对特定的类似序列的对象（例如场景文本）进行了一些尝试来解决该问题。 Besides, some ambiguous characters are easier to distinguish when observing their contexts, e.g. it is easier to recognize “il” by contrasting the character heights than by recognizing each of them separately.此外，一些模糊的字符在观察其上下文时更容易区分，例如，通过对比字符高度更容易识别“il”而不是分别识别它们中的每一个。 A lexicon is a set of label sequences that prediction is constraint to, e.g. a spell checking dictionary.词典是一组标签序列，预测受拼写检查字典约束。 Here, each $y_t \in\Re^{\|{\cal L}’\|}$ is a probability distribution over the set ${\cal L}’ = {\cal L} \cup$, where ${\cal L}$ contains all labels in the task (e.g. all English characters), as well as a ’blank’ label denoted by -.这里，每个$y_t \in\Re^{\|{\cal L}’\|}$是在集合${\cal L}’ = {\cal L} \cup$上的概率分布，其中${\cal L}$包含了任务中的所有标签（例如，所有英文字符），以及由-表示的“空白”标签。 Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation.基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。 Unlike [22], CRNN is not limited to recognize a word in a known dictionary, and able to handle random strings (e.g. telephone numbers), sentences or other scripts like Chinese words.与[22]不同，CRNN不限于识别已知字典中的单词，并且能够处理随机字符串（例如电话号码），句子或其他诸如中文单词的脚本。 Consequently, some notes can be recognized by comparing them with the nearby notes, e.g. contrasting their vertical positions.因此，通过将一些音符与附近的音符进行比较可以识别它们，例如对比他们的垂直位置。 It directly runs on coarse level labels (e.g. words), requiring no detailed annotations for each individual element (e.g. characters) in the training phase.它直接在粗粒度的标签（例如单词）上运行，在训练阶段不需要详细标注每一个单独的元素（例如字符）。 It directly runs on coarse level labels (e.g. words), requiring no detailed annotations for each individual element (e.g. characters) in the training phase.它直接在粗粒度的标签（例如单词）上运行，在训练阶段不需要详细标注每一个单独的元素（例如字符）。
5	unconstrained (9)	[ˌʌnkən'streɪnd]	3) It has the same property of RNN, being able to produce a sequence of labels; 4) It is unconstrained to the lengths of sequence-like objects, requiring only height normalization in both training and testing phases;3）具有与RNN相同的性质，能够产生一系列标签；4）对类序列对象的长度无约束，只需要在训练阶段和测试阶段对高度进行归一化； In the unconstrained lexicon cases, our method achieves the best performance on SVT, yet, is still behind some approaches [8, 22] on IC03 and IC13.在无约束词典的情况下，我们的方法在SVT上仍取得了最佳性能，但在IC03和IC13上仍然落后于一些方法[8,22]。 Note that the blanks in the “none” columns of Table 2 denote that such approaches are unable to be applied to recognition without lexicon or did not report the recognition accuracies in the unconstrained cases.注意，表2的“none”列中的空白表示这种方法不能应用于没有词典的识别，或者在无约束的情况下不能报告识别精度。 The best persformance is reported by [22] in the unconstrained lexicon cases, benefiting from its large dictionary, however, it is not a model strictly unconstrained to a lexicon as mentioned before.[22]中报告的最佳性能是在无约束词典的情况下，受益于它的大字典，然而，它不是前面提到的严格的无约束词典模型。 The best persformance is reported by [22] in the unconstrained lexicon cases, benefiting from its large dictionary, however, it is not a model strictly unconstrained to a lexicon as mentioned before.[22]中报告的最佳性能是在无约束词典的情况下，受益于它的大字典，然而，它不是前面提到的严格的无约束词典模型。 In this sense, our results in the unconstrained lexicon case are still promising.在这个意义上，我们在无限制词典表中的结果仍然是有前途的。 For further understanding the advantages of the proposed algorithm over other text recognition approaches, we provide a comprehensive comparison on several properties named E2E Train, Conv Ftrs, CharGT-Free, Unconstrained, and Model Size, as summarized in Table 3.为了进一步了解与其它文本识别方法相比，所提出算法的优点，我们提供了在一些特性上的综合比较，这些特性名称为E2E Train，Conv Ftrs，CharGT-Free，Unconstrained和Model Size，如表3所示。 Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions).比较的属性包括：1)端到端训练(E2E Train)；2)从图像中直接学习卷积特征而不是使用手动设计的特征(Conv Ftrs)；3)训练期间不需要字符的实际边界框(CharGT-Free)；4)不受限于预定义字典(Unconstrained)；5)模型大小（如果使用端到端模型），通过模型参数数量来衡量(Model Size, M表示百万)。 Unconstrained: This column is to indicate whether the trained model is constrained to a specific dictionary, unable to handling out-of-dictionary words or random sequences.Unconstrained：这一列用来表明训练模型是否受限于一个特定的字典，是否不能处理字典之外的单词或随机序列。
6	DCNN (8)	[!≈ di: si: en en]	Recently, the community has seen a strong revival of neural networks, which is mainly stimulated by the great success of deep neural network models, specifically Deep Convolutional Neural Networks (DCNN), in various vision tasks.最近，社区已经看到神经网络的强大复兴，这主要受到深度神经网络模型，特别是深度卷积神经网络（DCNN）在各种视觉任务中的巨大成功的推动。 Consequently, the most popular deep models like DCNN [25, 26] cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and thus are incapable of producing a variable-length label sequence.因此，最流行的深度模型像DCNN[25,26]不能直接应用于序列预测，因为DCNN模型通常对具有固定维度的输入和输出进行操作，因此不能产生可变长度的标签序列。 Consequently, the most popular deep models like DCNN [25, 26] cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and thus are incapable of producing a variable-length label sequence.因此，最流行的深度模型像DCNN[25,26]不能直接应用于序列预测，因为DCNN模型通常对具有固定维度的输入和输出进行操作，因此不能产生可变长度的标签序列。 For example, the algorithms in [35, 8] firstly detect individual characters and then recognize these detected characters with DCNN models, which are trained using labeled character images.例如，[35,8]中的算法首先检测单个字符，然后用DCNN模型识别这些检测到的字符，并使用标注的字符图像进行训练。 In summary, current systems based on DCNN can not be directly used for image-based sequence recognition.总之，目前基于DCNN的系统不能直接用于基于图像的序列识别。 The proposed neural network model is named as Convolutional Recurrent Neural Network (CRNN), since it is a combination of DCNN and RNN.所提出的神经网络模型被称为卷积循环神经网络（CRNN），因为它是DCNN和RNN的组合。 2) It has the same property of DCNN on learning informative representations directly from image data, requiring neither hand-craft features nor preprocessing steps, including binarization/segmentation, component localization, etc.;2）直接从图像数据学习信息表示时具有与DCNN相同的性质，既不需要手工特征也不需要预处理步骤，包括二值化/分割，组件定位等； 5) It achieves better or highly competitive performance on scene texts (word recognition) than the prior arts [23, 8]; 6) It contains much less parameters than a standard DCNN model, consuming less storage space.5）与现有技术相比，它在场景文本（字识别）上获得更好或更具竞争力的表现[23,8]。6）它比标准DCNN模型包含的参数要少得多，占用更少的存储空间。
7	IC03 (8)		Four popular benchmarks for scene text recognition are used for performance evaluation, namely ICDAR 2003 (IC03), ICDAR 2013 (IC13), IIIT 5k-word (IIIT5k), and Street View Text (SVT).有四个流行的基准数据集用于场景文本识别的性能评估，即ICDAR 2003（IC03），ICDAR 2013（IC13），IIIT 5k-word（IIIT5k）和Street View Text (SVT)。 IC03 [27] test dataset contains 251 scene images with labeled text bounding boxes.IC03[27]测试数据集包含251个具有标记文本边界框的场景图像。 IC13 [24] test dataset inherits most of its data from IC03.IC13[24]测试数据集继承了IC03中的大部分数据。 The average testing time is 0.16s/sample, as measured on IC03 without a lexicon.平均测试时间为0.16s/样本，在IC03上测得的，没有词典。 The approximate lexicon search is applied to the 50k lexicon of IC03, with the parameter δ set to 3.近似词典搜索应用于IC03的50k词典，参数δ设置为3。 Specifically, we obtain superior performance on IIIT5k, and SVT compared to [22], only achieved lower performance on IC03 with the “Full” lexicon.具体来说，与[22]相比，我们在IIIT5k和SVT上获得了卓越的性能，仅在IC03上通过“Full”词典实现了较低性能。 In the unconstrained lexicon cases, our method achieves the best performance on SVT, yet, is still behind some approaches [8, 22] on IC03 and IC13.在无约束词典的情况下，我们的方法在SVT上仍取得了最佳性能，但在IC03和IC13上仍然落后于一些方法[8,22]。 Tested on the IC03 dataset with the 50k lexicon.在IC03数据集上使用50k词典进行的测试。
8	real-world (7)	[!≈ ˈri:əl wɜ:ld]	(4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios.（4）它产生了一个有效而小得多的模型，这对于现实世界的应用场景更为实用。 Our network is trained on the synthetic data once, and tested on all other real-world test datasets without any fine-tuning on their training data.我们的网络在合成数据上进行了一次训练，并在所有其它现实世界的测试数据集上进行了测试，而没有在其训练数据上进行任何微调。 It contains 200 samples, some of which are shown in Fig. 5. b; 3) “Real-World”, which contains 200 images of score fragments taken from music books with a phone camera.它包含200个样本，其中一些如图5.b所示；3）“现实世界”，其中包含用手机相机拍摄的音乐书籍中的200张图像。 (c) Real-world score images taken with a mobile phone camera.(c)用手机相机拍摄的现实世界的乐谱图像。 The CRNN outperforms the two commercial systems by a large margin. The Capella Scan and PhotoScore systems perform reasonably well on the Clean dataset, but their performances drop significantly on synthesized and real-world data.Capella Scan和PhotoScore系统在干净的数据集上表现相当不错，但是它们的性能在合成和现实世界数据方面显著下降。 The main reason is that they rely on robust binarization to detect staff lines and notes, but the binarization step often fails on synthesized and real-world data due to bad lighting condition, noise corruption and cluttered background.主要原因是它们依赖于强大的二值化来检五线谱和音符，但是由于光线不良，噪音破坏和杂乱的背景，二值化步骤经常会在合成数据和现实数据上失败。 To further speed up CRNN and make it more practical in real-world applications is another direction that is worthy of exploration in the future.进一步加快CRNN，使其在现实应用中更加实用，是未来值得探索的另一个方向。
9	i.e. (7)	[ˌaɪ ˈi:]	Secondly, RNN can back-propagates error differentials to its input, i.e. the convolutional layer, allowing us to jointly train the recurrent layers and the convolutional layers in a unified network.其次，RNN可以将误差差值反向传播到其输入，即卷积层，从而允许我们在统一的网络中共同训练循环层和卷积层。 In recurrent layers, error differentials are propagated in the opposite directions of the arrows shown in Fig. 3. b, i.e. Back-Propagation Through Time (BPTT).在循环层中，误差在图3.b所示箭头的相反方向传播，即反向传播时间（BPTT）。 The sequence $\mathbf{l}^{}$ is approximately found by $\mathbf{l}^{}\approx{\cal B}(\arg\max{\boldsymbol{\pi}}p(\boldsymbol{\pi}\|\mathbf{y}))$, i.e. taking the most probable label $\pi{t}$ at each time stampt, and map the resulted sequence onto $\mathbf{l}^{}$.序列$\mathbf{l}^{}$通过$\mathbf{l}^{}\approx{\cal B}(\arg\max{\boldsymbol{\pi}}p(\boldsymbol{\pi}\|\mathbf{y}))$近似发现，即在每个时间戳t采用最大概率的标签$\pi{t}$，并将结果序列映射到$\mathbf{l}^{}$。 Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation.基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。 Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation.基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。 Two measures are used for evaluating the recognition performance: 1) fragment accuracy, i.e. the percentage of score fragments correctly recognized; 2) average edit distance, i.e. the average edit distance between predicted pitch sequences and the ground truths.使用两种方法来评估识别性能：1）片段准确度，即正确识别的乐谱片段的百分比；2）平均编辑距离，即预测音调序列与真实值之间的平均编辑距离。 Two measures are used for evaluating the recognition performance: 1) fragment accuracy, i.e. the percentage of score fragments correctly recognized; 2) average edit distance, i.e. the average edit distance between predicted pitch sequences and the ground truths.使用两种方法来评估识别性能：1）片段准确度，即正确识别的乐谱片段的百分比；2）平均编辑距离，即预测音调序列与真实值之间的平均编辑距离。
10	trainable (6)	[t'reɪnəbl]	An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition基于图像序列识别的端到端可训练神经网络及其在场景文本识别中的应用 Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned.与以前的场景文本识别系统相比，所提出的架构具有四个不同的特性：（1）与大多数现有的组件需要单独训练和协调的算法相比，它是端对端训练的。 Being robust, rich and trainable, deep convolutional features have been widely adopted for different kinds of visual recognition tasks [25, 12].鲁棒的，丰富的和可训练的深度卷积特征已被广泛应用于各种视觉识别任务[25,12]。 Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions).比较的属性包括：1)端到端训练(E2E Train)；2)从图像中直接学习卷积特征而不是使用手动设计的特征(Conv Ftrs)；3)训练期间不需要字符的实际边界框(CharGT-Free)；4)不受限于预定义字典(Unconstrained)；5)模型大小（如果使用端到端模型），通过模型参数数量来衡量(Model Size, M表示百万)。 Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions).比较的属性包括：1)端到端训练(E2E Train)；2)从图像中直接学习卷积特征而不是使用手动设计的特征(Conv Ftrs)；3)训练期间不需要字符的实际边界框(CharGT-Free)；4)不受限于预定义字典(Unconstrained)；5)模型大小（如果使用端到端模型），通过模型参数数量来衡量(Model Size, M表示百万)。 E2E Train: This column is to show whether a certain text reading model is end-to-end trainable, without any preprocess or through several separated steps, which indicates such approaches are elegant and clean for training.E2E Train：这一列是为了显示某种文字阅读模型是否可以进行端到端的训练，无需任何预处理或经过几个分离的步骤，这表明这种方法对于训练是优雅且干净的。
11	lexicon-based (6)	[!≈ ˈleksɪkən beɪst]	(3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks.（3）它不仅限于任何预定义的词汇，并且在无词典和基于词典的场景文本识别任务中都取得了显著的表现。 In practice, there exists two modes of transcription, namely the lexicon-free and lexicon-based transcriptions.在实践中，存在两种转录模式，即无词典转录和基于词典的转录。 In lexicon-based mode, predictions are made by choosing the label sequence that has the highest probability.在基于词典的模式中，通过选择具有最高概率的标签序列进行预测。 2.3.3 Lexicon-based transcription2.3.3 基于词典的转录 In lexicon-based mode, each test sample is associated with a lexicon {\cal D}.在基于字典的模式中，每个测试采样与词典{\cal D}相关联。 Larger \delta results in more candidates, thus more accurate lexicon-based transcription.更大的\delta导致更多的候选目标，从而基于词典的转录更准确。
12	OMR (6)	[!≈ əu em ɑ:(r)]	Recognizing musical scores in images is known as the Optical Music Recognition (OMR) problem.识别图像中的乐谱被称为光学音乐识别（OMR）问题。 We cast the OMR as a sequence recognition problem, and predict a sequence of musical notes directly from the image with CRNN.我们将OMR作为序列识别问题，直接用CRNN从图像中预测音符的序列。 For comparison, we evaluate two commercial OMR engines, namely the Capella Scan [3] and the PhotoScore [4].为了比较，我们评估了两种商用OMR引擎，即Capella Scan[3]和PhotoScore[4]。 Comparison of pitch recognition accuracies, among CRNN and two commercial OMR systems, on the three datasets we have collected.在我们收集的数据集上，CRNN和两个商业OMR系统对音调识别准确率的对比。 But it provides a new scheme for OMR, and has shown promising capabilities in pitch recognition.但它为OMR提供了一个新的方案，并且在音高识别方面表现出有前途的能力。 In addition, CRNN significantly outperforms other competitors on a benchmark for Optical Music Recognition (OMR), which verifies the generality of CRNN.此外，CRNN在光学音乐识别（OMR）的基准数据集上显著优于其它的竞争者，这验证了CRNN的泛化性。
13	annotation (5)	[ˌænə'teɪʃn]	For sequence-like objects, CRNN possesses several distinctive advantages over conventional neural network models: 1) It can be directly learned from sequence labels (for instance, words), requiring no detailed annotations (for instance, characters);对于类序列对象，CRNN与传统神经网络模型相比具有一些独特的优点：1）可以直接从序列标签（例如单词）学习，不需要详细的标注（例如字符）； Our method uses only synthetic text with word level labels as the training data, very different to PhotoOCR [8] which used 7.9 millions of real word images with character-level annotations for training.我们的方法只使用具有单词级标签的合成文本作为训练数据，与PhotoOCR[8]非常不同，后者使用790万个具有字符级标注的真实单词图像进行训练。 CharGT-Free: This column is to indicate whether the character-level annotations are essential for training the model.CharGT-Free：这一列用来表明字符级标注对于训练模型是否是必要的。 As the input and output labels of CRNN can be a sequence, character-level annotations are not necessary.由于CRNN的输入和输出标签是序列，因此字符级标注是不必要的。 It directly runs on coarse level labels (e.g. words), requiring no detailed annotations for each individual element (e.g. characters) in the training phase.它直接在粗粒度的标签（例如单词）上运行，在训练阶段不需要详细标注每一个单独的元素（例如字符）。
14	receptive (5)	[rɪˈseptɪv]	Therefore, each column of the feature maps corresponds to a rectangle region of the original image (termed the receptive field), and such rectangle regions are in the same order to their corresponding columns on the feature maps from left to right.因此，特征图的每列对应于原始图像的一个矩形区域（称为感受野），并且这些矩形区域与特征图上从左到右的相应列具有相同的顺序。 As illustrated in Fig. 2, each vector in the feature sequence is associated with a receptive field, and can be considered as the image descriptor for that region.如图2所示，特征序列中的每个向量关联一个感受野，并且可以被认为是该区域的图像描述符。 The receptive field.感受野。 Each vector in the extracted feature sequence is associated with a receptive field on the input image, and can be considered as the feature vector of that field.提取的特征序列中的每一个向量关联输入图像的一个感受野，可认为是该区域的特征向量。 On top of that, the rectangular pooling windows yield rectangular receptive fields (illustrated in Fig. 2), which are beneficial for recognizing some characters that have narrow shapes, such as ’i’ and ’l’.最重要的是，矩形池窗口产生矩形感受野（如图2所示），这有助于识别一些具有窄形状的字符，例如i和l。
15	differential (5)	[ˌdɪfəˈrenʃl]	Secondly, RNN can back-propagates error differentials to its input, i.e. the convolutional layer, allowing us to jointly train the recurrent layers and the convolutional layers in a unified network.其次，RNN可以将误差差值反向传播到其输入，即卷积层，从而允许我们在统一的网络中共同训练循环层和卷积层。 In recurrent layers, error differentials are propagated in the opposite directions of the arrows shown in Fig. 3. b, i.e. Back-Propagation Through Time (BPTT).在循环层中，误差在图3.b所示箭头的相反方向传播，即反向传播时间（BPTT）。 At the bottom of the recurrent layers, the sequence of propagated differentials are concatenated into maps, inverting the operation of converting feature maps into feature sequences, and fed back to the convolutional layers.在循环层的底部，传播差异的序列被连接成映射，将特征映射转换为特征序列的操作进行反转并反馈到卷积层。 In particular, in the transcription layer, error differentials are back-propagated with the forward-backward algorithm, as described in [15].特别地，在转录层中，如[15]所述，误差使用前向算法进行反向传播。 In the recurrent layers, the Back-Propagation Through Time (BPTT) is applied to calculate the error differentials.在循环层中，应用随时间反向传播（BPTT）来计算误差。
16	conditional (5)	[kənˈdɪʃənl]	We adopt the conditional probability defined in the Connectionist Temporal Classification (CTC) layer proposed by Graves et al. [15].我们采用Graves等人[15]提出的联接时间分类（CTC）层中定义的条件概率。 The formulation of the conditional probability is briefly described as follows: The input is a sequence $y = y_1,…,y_T$ where T is the sequence length.条件概率的公式简要描述如下：输入是序列$y = y_1,…,y_T$，其中T是序列长度。 Then, the conditional probability is defined as the sum of probabilities of all $\boldsymbol{\pi}$ that are mapped by ${\cal B}$ onto $\mathbf{l}$:然后，条件概率被定义为由${\cal B}$映射到$\mathbf{l}$上的所有$\boldsymbol{\pi}$的概率之和： Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation.基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。 The objective is to minimize the negative log-likelihood of conditional probability of ground truth:目标是最小化真实条件概率的负对数似然：
17	BK-tree (5)		The candidates ${\cal N}_{\delta}(\mathbf{l}’)$ can be found efficiently with the BK-tree data structure[9], which is a metric tree specifically adapted to discrete metric spaces.可以使用BK树数据结构[9]有效地找到候选目标${\cal N}_{\delta}(\mathbf{l}’)$，这是一种专门适用于离散度量空间的度量树。 The search time complexity of BK-tree is $O(\log\|{\cal D}\|)$, where $\|{\cal D}\|$ is the lexicon size.BK树的搜索时间复杂度为$O(\log\|{\cal D}\|)$，其中$\|{\cal D}\|$是词典大小。 In our approach, a BK-tree is constructed offline for a lexicon.在我们的方法中，一个词典离线构造一个BK树。 We implement the network within the Torch7 [10] framework, with custom implementations for the LSTM units (in Torch7/CUDA), the transcription layer (in C++) and the BK-tree data structure (in C++).我们在Torch7[10]框架内实现了网络，使用定制实现的LSTM单元（Torch7/CUDA），转录层（C++）和BK树数据结构（C++）。 On the other hand, the computational cost grows with larger \delta, due to longer BK-tree search time, as well as larger number of candidate sequences for testing.另一方面，由于更长的BK树搜索时间，以及更大数量的候选序列用于测试，计算成本随着\delta的增大而增加。
18	synthetic (5)	[sɪnˈθetɪk]	For all the experiments for scene text recognition, we use the synthetic dataset (Synth) released by Jaderberg et al. [20] as the training data.对于场景文本识别的所有实验，我们使用Jaderberg等人[20]发布的合成数据集（Synth）作为训练数据。 Such images are generated by a synthetic text engine and are highly realistic.这样的图像由合成文本引擎生成并且是非常现实的。 Our network is trained on the synthetic data once, and tested on all other real-world test datasets without any fine-tuning on their training data.我们的网络在合成数据上进行了一次训练，并在所有其它现实世界的测试数据集上进行了测试，而没有在其训练数据上进行任何微调。 Even though the CRNN model is purely trained with synthetic text data, it works well on real images from standard text recognition benchmarks.即使CRNN模型是在纯合成文本数据上训练，但它在标准文本识别基准数据集的真实图像上工作良好。 Our method uses only synthetic text with word level labels as the training data, very different to PhotoOCR [8] which used 7.9 millions of real word images with character-level annotations for training.我们的方法只使用具有单词级标签的合成文本作为训练数据，与PhotoOCR[8]非常不同，后者使用790万个具有字符级标注的真实单词图像进行训练。
19	pitch (5)	[pɪtʃ]	For simplicity, we recognize pitches only, ignore all chords and assume the same major scales (C major) for all scores.为了简单起见，我们仅认识音调，忽略所有和弦，并假定所有乐谱具有相同的大调音阶（C大调）。 To the best of our knowledge, there exists no public datasets for evaluating algorithms on pitch recognition.据我们所知，没有用于评估音调识别算法的公共数据集。 Two measures are used for evaluating the recognition performance: 1) fragment accuracy, i.e. the percentage of score fragments correctly recognized; 2) average edit distance, i.e. the average edit distance between predicted pitch sequences and the ground truths.使用两种方法来评估识别性能：1）片段准确度，即正确识别的乐谱片段的百分比；2）平均编辑距离，即预测音调序列与真实值之间的平均编辑距离。 Comparison of pitch recognition accuracies, among CRNN and two commercial OMR systems, on the three datasets we have collected.在我们收集的数据集上，CRNN和两个商业OMR系统对音调识别准确率的对比。 But it provides a new scheme for OMR, and has shown promising capabilities in pitch recognition.但它为OMR提供了一个新的方案，并且在音高识别方面表现出有前途的能力。
20	generality (4)	[ˌdʒenəˈræləti]	Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.此外，提出的算法在基于图像的音乐配乐识别任务中表现良好，这显然证实了它的泛化性。 To further demonstrate the generality of CRNN, we verify the proposed algorithm on a music score recognition task in Sec. 3.4.为了进一步证明CRNN的泛化性，在3.4小节我们在乐谱识别任务上验证了提出的算法。 The results have shown the generality of CRNN, in that it can be readily applied to other image-based sequence recognition problems, requiring minimal domain knowledge.结果显示了CRNN的泛化性，因为它可以很容易地应用于其它的基于图像的序列识别问题，需要极少的领域知识。 In addition, CRNN significantly outperforms other competitors on a benchmark for Optical Music Recognition (OMR), which verifies the generality of CRNN.此外，CRNN在光学音乐识别（OMR）的基准数据集上显著优于其它的竞争者，这验证了CRNN的泛化性。
21	Eq (4)		Directly computing Eq.1 would be computationally infeasible due to the exponentially large number of summation items.由于存在指数级数量的求和项，直接计算方程1在计算上是不可行的。 However, Eq.1 can be efficiently computed using the forward-backward algorithm described in [15].然而，使用[15]中描述的前向算法可以有效计算方程1。 In this mode, the sequence $\mathbf{l}^{}$ that has the highest probability as defined in Eq.1 is taken as the prediction.在这种模式下，将具有方程1中定义的最高概率的序列$\mathbf{l}^{}$作为预测。 Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation.基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。
22	ADADELTA (4)	[!≈ eɪ di: eɪ di: i: el ti: eɪ]	For optimization, we use the ADADELTA [37] to automatically calculate per-dimension learning rates.为了优化，我们使用ADADELTA[37]自动计算每维的学习率。 Compared with the conventional momentum [31] method, ADADELTA requires no manual setting of a learning rate.与传统的动量[31]方法相比，ADADELTA不需要手动设置学习率。 More importantly, we find that optimization using ADADELTA converges faster than the momentum method.更重要的是，我们发现使用ADADELTA的优化收敛速度比动量方法快。 Networks are trained with ADADELTA, setting the parameter ρ to 0.9.网络用ADADELTA训练，将参数ρ设置为0.9。
23	SVT (4)	[!≈ es vi: ti:]	Four popular benchmarks for scene text recognition are used for performance evaluation, namely ICDAR 2003 (IC03), ICDAR 2013 (IC13), IIIT 5k-word (IIIT5k), and Street View Text (SVT).有四个流行的基准数据集用于场景文本识别的性能评估，即ICDAR 2003（IC03），ICDAR 2013（IC13），IIIT 5k-word（IIIT5k）和Street View Text (SVT)。 SVT [34] test dataset consists of 249 street view images collected from Google Street View.SVT[34]测试数据集由从Google街景视图收集的249张街景图像组成。 Specifically, we obtain superior performance on IIIT5k, and SVT compared to [22], only achieved lower performance on IC03 with the “Full” lexicon.具体来说，与[22]相比，我们在IIIT5k和SVT上获得了卓越的性能，仅在IC03上通过“Full”词典实现了较低性能。 In the unconstrained lexicon cases, our method achieves the best performance on SVT, yet, is still behind some approaches [8, 22] on IC03 and IC13.在无约束词典的情况下，我们的方法在SVT上仍取得了最佳性能，但在IC03和IC13上仍然落后于一些方法[8,22]。
24	synthesize (4)	[ˈsɪnθəsaɪz]	a; 2) “Synthesized”, which is created from “Clean”, using the augmentation strategy mentioned above.实例如图5.a所示；2）“合成的”，使用“纯净的”创建的，使用了上述的增强策略。 Figure 5. (a) Clean musical scores images collected from [2] (b) Synthesized musical score images.图5。(a)从[2]中收集的干净的乐谱图像。(b)合成的乐谱图像。 The CRNN outperforms the two commercial systems by a large margin. The Capella Scan and PhotoScore systems perform reasonably well on the Clean dataset, but their performances drop significantly on synthesized and real-world data.Capella Scan和PhotoScore系统在干净的数据集上表现相当不错，但是它们的性能在合成和现实世界数据方面显著下降。 The main reason is that they rely on robust binarization to detect staff lines and notes, but the binarization step often fails on synthesized and real-world data due to bad lighting condition, noise corruption and cluttered background.主要原因是它们依赖于强大的二值化来检五线谱和音符，但是由于光线不良，噪音破坏和杂乱的背景，二值化步骤经常会在合成数据和现实数据上失败。
25	ICDAR (3)	[!≈ aɪ si: di: eɪ ɑ:(r)]	The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts.在包括IIIT-5K，Street View Text和ICDAR数据集在内的标准基准数据集上的实验证明了提出的算法比现有技术的更有优势。 Four popular benchmarks for scene text recognition are used for performance evaluation, namely ICDAR 2003 (IC03), ICDAR 2013 (IC13), IIIT 5k-word (IIIT5k), and Street View Text (SVT).有四个流行的基准数据集用于场景文本识别的性能评估，即ICDAR 2003（IC03），ICDAR 2013（IC13），IIIT 5k-word（IIIT5k）和Street View Text (SVT)。 Four popular benchmarks for scene text recognition are used for performance evaluation, namely ICDAR 2003 (IC03), ICDAR 2013 (IC13), IIIT 5k-word (IIIT5k), and Street View Text (SVT).有四个流行的基准数据集用于场景文本识别的性能评估，即ICDAR 2003（IC03），ICDAR 2013（IC13），IIIT 5k-word（IIIT5k）和Street View Text (SVT)。
26	sequential (3)	[sɪˈkwenʃl]	For example, Graves et al. [16] extract a set of geometrical or image features from handwritten texts, while Su and Lu [33] convert word images into sequential HOG features.例如，Graves等[16]从手写文本中提取一系列几何或图像特征，而Su和Lu[33]将字符图像转换为序列HOG特征。 Such component is used to extract a sequential feature representation from an input image.这样的组件用于从输入图像中提取序列特征表示。 In CRNN, we convey deep features into sequential representations in order to be invariant to the length variation of sequence-like objects.在CRNN中，我们将深度特征传递到序列表示中，以便对类序列对象的长度变化保持不变。
27	binarization (3)		2) It has the same property of DCNN on learning informative representations directly from image data, requiring neither hand-craft features nor preprocessing steps, including binarization/segmentation, component localization, etc.;2）直接从图像数据学习信息表示时具有与DCNN相同的性质，既不需要手工特征也不需要预处理步骤，包括二值化/分割，组件定位等； The main reason is that they rely on robust binarization to detect staff lines and notes, but the binarization step often fails on synthesized and real-world data due to bad lighting condition, noise corruption and cluttered background.主要原因是它们依赖于强大的二值化来检五线谱和音符，但是由于光线不良，噪音破坏和杂乱的背景，二值化步骤经常会在合成数据和现实数据上失败。 The main reason is that they rely on robust binarization to detect staff lines and notes, but the binarization step often fails on synthesized and real-world data due to bad lighting condition, noise corruption and cluttered background.主要原因是它们依赖于强大的二值化来检五线谱和音符，但是由于光线不良，噪音破坏和杂乱的背景，二值化步骤经常会在合成数据和现实数据上失败。
28	contextual (3)	[kənˈtekstʃuəl]	Firstly, RNN has a strong capability of capturing contextual information within a sequence.首先，RNN具有很强的捕获序列内上下文信息的能力。 Using contextual cues for image-based sequence recognition is more stable and helpful than treating each symbol independently.对于基于图像的序列识别使用上下文提示比独立处理每个符号更稳定且更有帮助。 Besides, recurrent layers in CRNN can utilize contextual information in the score.此外，CRNN中的循环层可以利用乐谱中的上下文信息。
29	k-word (3)	[!≈ keɪ wɜ:d]	Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation.基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。 Four popular benchmarks for scene text recognition are used for performance evaluation, namely ICDAR 2003 (IC03), ICDAR 2013 (IC13), IIIT 5k-word (IIIT5k), and Street View Text (SVT).有四个流行的基准数据集用于场景文本识别的性能评估，即ICDAR 2003（IC03），ICDAR 2013（IC13），IIIT 5k-word（IIIT5k）和Street View Text (SVT)。 Each image has been associated to a 50-words lexicon and a 1k-words lexicon.每张图像关联一个50词的词典和一个1000词的词典。
30	IC13 (3)		Four popular benchmarks for scene text recognition are used for performance evaluation, namely ICDAR 2003 (IC03), ICDAR 2013 (IC13), IIIT 5k-word (IIIT5k), and Street View Text (SVT).有四个流行的基准数据集用于场景文本识别的性能评估，即ICDAR 2003（IC03），ICDAR 2013（IC13），IIIT 5k-word（IIIT5k）和Street View Text (SVT)。 IC13 [24] test dataset inherits most of its data from IC03.IC13[24]测试数据集继承了IC03中的大部分数据。 In the unconstrained lexicon cases, our method achieves the best performance on SVT, yet, is still behind some approaches [8, 22] on IC03 and IC13.在无约束词典的情况下，我们的方法在SVT上仍取得了最佳性能，但在IC03和IC13上仍然落后于一些方法[8,22]。
31	IIIT5k (3)		Four popular benchmarks for scene text recognition are used for performance evaluation, namely ICDAR 2003 (IC03), ICDAR 2013 (IC13), IIIT 5k-word (IIIT5k), and Street View Text (SVT).有四个流行的基准数据集用于场景文本识别的性能评估，即ICDAR 2003（IC03），ICDAR 2013（IC13），IIIT 5k-word（IIIT5k）和Street View Text (SVT)。 IIIT5k [28] contains 3,000 cropped word test images collected from the Internet.IIIT5k[28]包含从互联网收集的3000张裁剪的词测试图像。 Specifically, we obtain superior performance on IIIT5k, and SVT compared to [22], only achieved lower performance on IC03 with the “Full” lexicon.具体来说，与[22]相比，我们在IIIT5k和SVT上获得了卓越的性能，仅在IC03上通过“Full”词典实现了较低性能。
32	rectangular (3)	[rek'tæŋɡjələ(r)]	In the 3rd and the 4th max-pooling layers, we adopt 1 × 2 sized rectangular pooling windows instead of the conventional squared ones.在第3和第4个最大池化层中，我们采用1×2大小的矩形池化窗口而不是传统的平方形。 On top of that, the rectangular pooling windows yield rectangular receptive fields (illustrated in Fig. 2), which are beneficial for recognizing some characters that have narrow shapes, such as ’i’ and ’l’.最重要的是，矩形池窗口产生矩形感受野（如图2所示），这有助于识别一些具有窄形状的字符，例如i和l。 On top of that, the rectangular pooling windows yield rectangular receptive fields (illustrated in Fig. 2), which are beneficial for recognizing some characters that have narrow shapes, such as ’i’ and ’l’.最重要的是，矩形池窗口产生矩形感受野（如图2所示），这有助于识别一些具有窄形状的字符，例如i和l。
33	character-level (3)	[!≈ ˈkærəktə(r) ˈlevl]	Our method uses only synthetic text with word level labels as the training data, very different to PhotoOCR [8] which used 7.9 millions of real word images with character-level annotations for training.我们的方法只使用具有单词级标签的合成文本作为训练数据，与PhotoOCR[8]非常不同，后者使用790万个具有字符级标注的真实单词图像进行训练。 CharGT-Free: This column is to indicate whether the character-level annotations are essential for training the model.CharGT-Free：这一列用来表明字符级标注对于训练模型是否是必要的。 As the input and output labels of CRNN can be a sequence, character-level annotations are not necessary.由于CRNN的输入和输出标签是序列，因此字符级标注是不必要的。
34	E2E (3)		For further understanding the advantages of the proposed algorithm over other text recognition approaches, we provide a comprehensive comparison on several properties named E2E Train, Conv Ftrs, CharGT-Free, Unconstrained, and Model Size, as summarized in Table 3.为了进一步了解与其它文本识别方法相比，所提出算法的优点，我们提供了在一些特性上的综合比较，这些特性名称为E2E Train，Conv Ftrs，CharGT-Free，Unconstrained和Model Size，如表3所示。 Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions).比较的属性包括：1)端到端训练(E2E Train)；2)从图像中直接学习卷积特征而不是使用手动设计的特征(Conv Ftrs)；3)训练期间不需要字符的实际边界框(CharGT-Free)；4)不受限于预定义字典(Unconstrained)；5)模型大小（如果使用端到端模型），通过模型参数数量来衡量(Model Size, M表示百万)。 E2E Train: This column is to show whether a certain text reading model is end-to-end trainable, without any preprocess or through several separated steps, which indicates such approaches are elegant and clean for training.E2E Train：这一列是为了显示某种文字阅读模型是否可以进行端到端的训练，无需任何预处理或经过几个分离的步骤，这表明这种方法对于训练是优雅且干净的。
35	Ftrs (3)		For further understanding the advantages of the proposed algorithm over other text recognition approaches, we provide a comprehensive comparison on several properties named E2E Train, Conv Ftrs, CharGT-Free, Unconstrained, and Model Size, as summarized in Table 3.为了进一步了解与其它文本识别方法相比，所提出算法的优点，我们提供了在一些特性上的综合比较，这些特性名称为E2E Train，Conv Ftrs，CharGT-Free，Unconstrained和Model Size，如表3所示。 Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions).比较的属性包括：1)端到端训练(E2E Train)；2)从图像中直接学习卷积特征而不是使用手动设计的特征(Conv Ftrs)；3)训练期间不需要字符的实际边界框(CharGT-Free)；4)不受限于预定义字典(Unconstrained)；5)模型大小（如果使用端到端模型），通过模型参数数量来衡量(Model Size, M表示百万)。 Conv Ftrs: This column is to indicate whether an approach uses the convolutional features learned from training images directly or handcraft features as the basic representations.Conv Ftrs：这一列用来表明一个方法是否使用从训练图像直接学习到的卷积特征或手动特征作为基本的表示。
36	CharGT-Free (3)		For further understanding the advantages of the proposed algorithm over other text recognition approaches, we provide a comprehensive comparison on several properties named E2E Train, Conv Ftrs, CharGT-Free, Unconstrained, and Model Size, as summarized in Table 3.为了进一步了解与其它文本识别方法相比，所提出算法的优点，我们提供了在一些特性上的综合比较，这些特性名称为E2E Train，Conv Ftrs，CharGT-Free，Unconstrained和Model Size，如表3所示。 Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions).比较的属性包括：1)端到端训练(E2E Train)；2)从图像中直接学习卷积特征而不是使用手动设计的特征(Conv Ftrs)；3)训练期间不需要字符的实际边界框(CharGT-Free)；4)不受限于预定义字典(Unconstrained)；5)模型大小（如果使用端到端模型），通过模型参数数量来衡量(Model Size, M表示百万)。 CharGT-Free: This column is to indicate whether the character-level annotations are essential for training the model.CharGT-Free：这一列用来表明字符级标注对于训练模型是否是必要的。
37	Capella (3)	[kəˈpelə]	For comparison, we evaluate two commercial OMR engines, namely the Capella Scan [3] and the PhotoScore [4].为了比较，我们评估了两种商用OMR引擎，即Capella Scan[3]和PhotoScore[4]。 The CRNN outperforms the two commercial systems by a large margin. The Capella Scan and PhotoScore systems perform reasonably well on the Clean dataset, but their performances drop significantly on synthesized and real-world data.Capella Scan和PhotoScore系统在干净的数据集上表现相当不错，但是它们的性能在合成和现实世界数据方面显著下降。 Compared with Capella Scan and PhotoScore, our CRNN-based system is still preliminary and misses many functionalities.与Capella Scan和PhotoScore相比，我们的基于CRNN的系统仍然是初步的，并且缺少许多功能。
38	PhotoScore (3)		For comparison, we evaluate two commercial OMR engines, namely the Capella Scan [3] and the PhotoScore [4].为了比较，我们评估了两种商用OMR引擎，即Capella Scan[3]和PhotoScore[4]。 The CRNN outperforms the two commercial systems by a large margin. The Capella Scan and PhotoScore systems perform reasonably well on the Clean dataset, but their performances drop significantly on synthesized and real-world data.Capella Scan和PhotoScore系统在干净的数据集上表现相当不错，但是它们的性能在合成和现实世界数据方面显著下降。 Compared with Capella Scan and PhotoScore, our CRNN-based system is still preliminary and misses many functionalities.与Capella Scan和PhotoScore相比，我们的基于CRNN的系统仍然是初步的，并且缺少许多功能。
39	extraction (2)	[ɪkˈstrækʃn]	A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed.提出了一种将特征提取，序列建模和转录整合到统一框架中的新型神经网络架构。 2.1. Feature Sequence Extraction2.1. 特征序列提取
40	arbitrary (2)	[ˈɑ:bɪtrəri]	(2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization.（2）它自然地处理任意长度的序列，不涉及字符分割或水平尺度归一化。 Thirdly, RNN is able to operate on sequences of arbitrary lengths, traversing from starts to ends.第三，RNN能够从头到尾对任意长度的序列进行操作。
41	jointly (2)	[dʒɔɪntlɪ]	Though CRNN is composed of different kinds of network architectures (eg. CNN and RNN), it can be jointly trained with one loss function.虽然CRNN由不同类型的网络架构（如CNN和RNN）组成，但可以通过一个损失函数进行联合训练。 Secondly, RNN can back-propagates error differentials to its input, i.e. the convolutional layer, allowing us to jointly train the recurrent layers and the convolutional layers in a unified network.其次，RNN可以将误差差值反向传播到其输入，即卷积层，从而允许我们在统一的网络中共同训练循环层和卷积层。
42	invariant (2)	[ɪnˈveəriənt]	As the layers of convolution, max-pooling, and element-wise activation function operate on local regions, they are translation invariant.由于卷积层，最大池化层和元素激活函数在局部区域上执行，因此它们是平移不变的。 In CRNN, we convey deep features into sequential representations in order to be invariant to the length variation of sequence-like objects.在CRNN中，我们将深度特征传递到序列表示中，以便对类序列对象的长度变化保持不变。
43	capability (2)	[ˌkeɪpəˈbɪləti]	Firstly, RNN has a strong capability of capturing contextual information within a sequence.首先，RNN具有很强的捕获序列内上下文信息的能力。 But it provides a new scheme for OMR, and has shown promising capabilities in pitch recognition.但它为OMR提供了一个新的方案，并且在音高识别方面表现出有前途的能力。
44	propagate (2)	[ˈprɒpəgeɪt]	In recurrent layers, error differentials are propagated in the opposite directions of the arrows shown in Fig. 3. b, i.e. Back-Propagation Through Time (BPTT).在循环层中，误差在图3.b所示箭头的相反方向传播，即反向传播时间（BPTT）。 At the bottom of the recurrent layers, the sequence of propagated differentials are concatenated into maps, inverting the operation of converting feature maps into feature sequences, and fed back to the convolutional layers.在循环层的底部，传播差异的序列被连接成映射，将特征映射转换为特征序列的操作进行反转并反馈到卷积层。
45	BPTT (2)	[!≈ bi: pi: ti: ti:]	In recurrent layers, error differentials are propagated in the opposite directions of the arrows shown in Fig. 3. b, i.e. Back-Propagation Through Time (BPTT).在循环层中，误差在图3.b所示箭头的相反方向传播，即反向传播时间（BPTT）。 In the recurrent layers, the Back-Propagation Through Time (BPTT) is applied to calculate the error differentials.在循环层中，应用随时间反向传播（BPTT）来计算误差。
46	log-likelihood (2)	[!≈ lɒg ˈlaɪklihʊd]	Consequently, when we use the negative log-likelihood of this probability as the objective to train the network, we only need images and their corresponding label sequences, avoiding the labor of labeling positions of individual characters.因此，当我们使用这种概率的负对数似然作为训练网络的目标函数时，我们只需要图像及其相应的标签序列，避免了标注单个字符位置的劳动。 The objective is to minimize the negative log-likelihood of conditional probability of ground truth:目标是最小化真实条件概率的负对数似然：
47	stampt (2)		where the probability of $\boldsymbol{\pi}$ is defined as $p(\boldsymbol{\pi}\|\mathbf{y})=\prod{t=1}^{T}y{\pi{t}}^{t},y{\pi{t}}^{t}$ is the probability of having label $\pi{t}$ at time stampt.$\boldsymbol{\pi}$的概率定义为$p(\boldsymbol{\pi}\|\mathbf{y})=\prod{t=1}^{T}y{\pi{t}}^{t}，y{\pi{t}}^{t}$是时刻t时有标签$\pi{t}$的概率。 The sequence $\mathbf{l}^{}$ is approximately found by $\mathbf{l}^{}\approx{\cal B}(\arg\max{\boldsymbol{\pi}}p(\boldsymbol{\pi}\|\mathbf{y}))$, i.e. taking the most probable label $\pi{t}$ at each time stampt, and map the resulted sequence onto $\mathbf{l}^{}$.序列$\mathbf{l}^{}$通过$\mathbf{l}^{}\approx{\cal B}(\arg\max{\boldsymbol{\pi}}p(\boldsymbol{\pi}\|\mathbf{y}))$近似发现，即在每个时间戳t采用最大概率的标签$\pi{t}$，并将结果序列映射到$\mathbf{l}^{}$。
48	forward-backward (2)	[!≈ ˈfɔ:wəd ˈbækwəd]	However, Eq.1 can be efficiently computed using the forward-backward algorithm described in [15].然而，使用[15]中描述的前向算法可以有效计算方程1。 In particular, in the transcription layer, error differentials are back-propagated with the forward-backward algorithm, as described in [15].特别地，在转录层中，如[15]所述，误差使用前向算法进行反向传播。
49	Hunspell (2)		Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation.基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。 In addition, we use a 50k words lexicon consisting of the words in the Hunspell spell-checking dictionary [1].此外，我们使用由Hunspell拼写检查字典[1]中的单词组成的5万个词的词典。
50	spell-checking (2)	[!≈ spel 'tʃekɪŋ]	Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation.基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。 In addition, we use a 50k words lexicon consisting of the words in the Hunspell spell-checking dictionary [1].此外，我们使用由Hunspell拼写检查字典[1]中的单词组成的5万个词的词典。
51	momentum (2)	[məˈmentəm]	Compared with the conventional momentum [31] method, ADADELTA requires no manual setting of a learning rate.与传统的动量[31]方法相比，ADADELTA不需要手动设置学习率。 More importantly, we find that optimization using ADADELTA converges faster than the momentum method.更重要的是，我们发现使用ADADELTA的优化收敛速度比动量方法快。
52	bounding (2)	[baundɪŋ]	IC03 [27] test dataset contains 251 scene images with labeled text bounding boxes.IC03[27]测试数据集包含251个具有标记文本边界框的场景图像。 Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions).比较的属性包括：1)端到端训练(E2E Train)；2)从图像中直接学习卷积特征而不是使用手动设计的特征(Conv Ftrs)；3)训练期间不需要字符的实际边界框(CharGT-Free)；4)不受限于预定义字典(Unconstrained)；5)模型大小（如果使用端到端模型），通过模型参数数量来衡量(Model Size, M表示百万)。
53	tweak (2)	[twi:k]	A tweak is made in order to make it suitable for recognizing English texts.为了使其适用于识别英文文本，对其进行了调整。 This tweak yields feature maps with larger width, hence longer feature sequence.这种调整产生宽度较大的特征图，因此具有更长的特征序列。
54	RAM (2)	[ræm]	Experiments are carried out on a workstation with a 2.50 GHz Intel(R) Xeon(R) E5-2609 CPU, 64GB RAM and an NVIDIA(R) Tesla(TM) K40 GPU.实验在具有2.50 GHz Intel（R）Xeon E5-2609 CPU，64GB RAM和NVIDIA（R）Tesla(TM) K40 GPU的工作站上进行。 Our model has 8.3 million parameters, taking only 33MB RAM (using 4-bytes single-precision float for each parameter), thus it can be easily ported to mobile devices.我们的模型有830万个参数，只有33MB RAM（每个参数使用4字节单精度浮点数），因此可以轻松地移植到移动设备上。
55	Tab (2)	[tæb]	Different from the configuration specified in Tab. 1, the 4th and 6th convolution layers are removed, and the 2-layer bidirectional LSTM is replaced by a 2-layer single directional LSTM.与表1中指定的配置不同，我们移除了第4和第6卷积层，将2层双向LSTM替换为2层单向LSTM。 Tab.表4总结了结果。
56	long-standing (1)	[ˈlɔŋstædiŋ]	Image-based sequence recognition has been a long-standing research topic in computer vision.基于图像的序列识别一直是计算机视觉中长期存在的研究课题。
57	predefined (1)	[pri:dɪ'faɪnd]	(3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks.（3）它不仅限于任何预定义的词汇，并且在无词典和基于词典的场景文本识别任务中都取得了显著的表现。
58	scenario (1)	[səˈnɑ:riəʊ]	(4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios.（4）它产生了一个有效而小得多的模型，这对于现实世界的应用场景更为实用。
59	revival (1)	[rɪˈvaɪvl]	Recently, the community has seen a strong revival of neural networks, which is mainly stimulated by the great success of deep neural network models, specifically Deep Convolutional Neural Networks (DCNN), in various vision tasks.最近，社区已经看到神经网络的强大复兴，这主要受到深度神经网络模型，特别是深度卷积神经网络（DCNN）在各种视觉任务中的巨大成功的推动。
60	drastically (1)	['drɑ:stɪklɪ]	Another unique property of sequence-like objects is that their lengths may vary drastically.类序列对象的另一个独特之处在于它们的长度可能会有很大变化。
61	congratulations (1)	[kənˌgrætjʊ'leɪʃənz]	For instance, English words can either consist of 2 characters such as “OK” or 15 characters such as “congratulations”.例如，英文单词可以由2个字符组成，如“OK”，或由15个字符组成，如“congratulations”。
62	incapable (1)	[ɪnˈkeɪpəbl]	Consequently, the most popular deep models like DCNN [25, 26] cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and thus are incapable of producing a variable-length label sequence.因此，最流行的深度模型像DCNN[25,26]不能直接应用于序列预测，因为DCNN模型通常对具有固定维度的输入和输出进行操作，因此不能产生可变长度的标签序列。
63	variable-length (1)	['veərɪəbll'eŋθ]	Consequently, the most popular deep models like DCNN [25, 26] cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and thus are incapable of producing a variable-length label sequence.因此，最流行的深度模型像DCNN[25,26]不能直接应用于序列预测，因为DCNN模型通常对具有固定维度的输入和输出进行操作，因此不能产生可变长度的标签序列。
64	generalized (1)	[ˈdʒenrəlaɪzd]	It turns out a large trained model with a huge number of classes, which is difficult to be generalized to other types of sequence-like objects, such as Chinese texts, musical scores, etc., because the numbers of basic combinations of such kind of sequences can be greater than 1 million.结果是一个大的训练模型中有很多类，这很难泛化到其它类型的类序列对象，如中文文本，音乐配乐等，因为这种序列的基本组合数目可能大于100万。
65	geometrical (1)	[ˌdʒi:ə'metrɪkl]	For example, Graves et al. [16] extract a set of geometrical or image features from handwritten texts, while Su and Lu [33] convert word images into sequential HOG features.例如，Graves等[16]从手写文本中提取一系列几何或图像特征，而Su和Lu[33]将字符图像转换为序列HOG特征。
66	HOG (1)	[hɒg]	For example, Graves et al. [16] extract a set of geometrical or image features from handwritten texts, while Su and Lu [33] convert word images into sequential HOG features.例如，Graves等[16]从手写文本中提取一系列几何或图像特征，而Su和Lu[33]将字符图像转换为序列HOG特征。
67	insightful (1)	[ˈɪnsaɪtfʊl]	Several conventional scene text recognition methods that are not based on neural networks also brought insightful ideas and novel representations into this field.一些不是基于神经网络的传统场景文本识别方法也为这一领域带来了有见地的想法和新颖的表现。
68	Almazan (1)		For example, Almazan et al. [5] and Rodriguez-Serrano et al. [30] proposed to embed word images and text strings in a common vectorial subspace, and word recognition is converted into a retrieval problem.例如，Almazan等人[5]和Rodriguez-Serrano等人[30]提出将单词图像和文本字符串嵌入到公共向量子空间中，并将词识别转换为检索问题。
69	Rodriguez-Serrano (1)		For example, Almazan et al. [5] and Rodriguez-Serrano et al. [30] proposed to embed word images and text strings in a common vectorial subspace, and word recognition is converted into a retrieval problem.例如，Almazan等人[5]和Rodriguez-Serrano等人[30]提出将单词图像和文本字符串嵌入到公共向量子空间中，并将词识别转换为检索问题。
70	vectorial (1)	[vek'tɒrɪəl]	For example, Almazan et al. [5] and Rodriguez-Serrano et al. [30] proposed to embed word images and text strings in a common vectorial subspace, and word recognition is converted into a retrieval problem.例如，Almazan等人[5]和Rodriguez-Serrano等人[30]提出将单词图像和文本字符串嵌入到公共向量子空间中，并将词识别转换为检索问题。
71	retrieval (1)	[rɪˈtri:vl]	For example, Almazan et al. [5] and Rodriguez-Serrano et al. [30] proposed to embed word images and text strings in a common vectorial subspace, and word recognition is converted into a retrieval problem.例如，Almazan等人[5]和Rodriguez-Serrano等人[30]提出将单词图像和文本字符串嵌入到公共向量子空间中，并将词识别转换为检索问题。
72	Gordo (1)		Yao et al. [36] and Gordo et al. [14] used mid-level features for scene text recognition.Yao等人[36]和Gordo等人[14]使用中层特征进行场景文本识别。
73	informative (1)	[ɪnˈfɔ:mətɪv]	2) It has the same property of DCNN on learning informative representations directly from image data, requiring neither hand-craft features nor preprocessing steps, including binarization/segmentation, component localization, etc.;2）直接从图像数据学习信息表示时具有与DCNN相同的性质，既不需要手工特征也不需要预处理步骤，包括二值化/分割，组件定位等；
74	hand-craft (1)	['hæn(d)krɑːft]	2) It has the same property of DCNN on learning informative representations directly from image data, requiring neither hand-craft features nor preprocessing steps, including binarization/segmentation, component localization, etc.;2）直接从图像数据学习信息表示时具有与DCNN相同的性质，既不需要手工特征也不需要预处理步骤，包括二值化/分割，组件定位等；
75	outputted (1)	['aʊt.pʊt]	On top of the convolutional network, a recurrent network is built for making prediction for each frame of the feature sequence, outputted by the convolutional layers.在卷积网络之上，构建了一个循环网络，用于对卷积层输出的特征序列的每一帧进行预测。
76	eg (1)		Though CRNN is composed of different kinds of network architectures (eg. CNN and RNN), it can be jointly trained with one loss function.虽然CRNN由不同类型的网络架构（如CNN和RNN）组成，但可以通过一个损失函数进行联合训练。
77	concatenation (1)	[kənˌkætəˈneɪʃn]	This means the i-th feature vector is the concatenation of the i-th columns of all the maps.这意味着第i个特征向量是所有特征图第i列的连接。
78	descriptor (1)	[dɪˈskrɪptə(r)]	As illustrated in Fig. 2, each vector in the feature sequence is associated with a receptive field, and can be considered as the image descriptor for that region.如图2所示，特征序列中的每个向量关联一个感受野，并且可以被认为是该区域的图像描述符。
79	holistic (1)	[həʊˈlɪstɪk]	However, these approaches usually extract holistic representation of the whole image by CNN, then the local deep features are collected for recognizing each component of a sequence-like object.然而，这些方法通常通过CNN提取整个图像的整体表示，然后收集局部深度特征来识别类序列对象的每个分量。
80	cue (1)	[kju:]	Using contextual cues for image-based sequence recognition is more stable and helpful than treating each symbol independently.对于基于图像的序列识别使用上下文提示比独立处理每个符号更稳定且更有帮助。
81	il (1)		Besides, some ambiguous characters are easier to distinguish when observing their contexts, e.g. it is easier to recognize “il” by contrasting the character heights than by recognizing each of them separately.此外，一些模糊的字符在观察其上下文时更容易区分，例如，通过对比字符高度更容易识别“il”而不是分别识别它们中的每一个。
82	back-propagate (1)	[!≈ bæk ˈprɒpəgeɪt]	Secondly, RNN can back-propagates error differentials to its input, i.e. the convolutional layer, allowing us to jointly train the recurrent layers and the convolutional layers in a unified network.其次，RNN可以将误差差值反向传播到其输入，即卷积层，从而允许我们在统一的网络中共同训练循环层和卷积层。
83	traverse (1)	[trəˈvɜ:s]	Thirdly, RNN is able to operate on sequences of arbitrary lengths, traversing from starts to ends.第三，RNN能够从头到尾对任意长度的序列进行操作。
84	Long-Short (1)	[!≈ lɒŋ ʃɔ:t]	Long-Short Term Memory 18, 11 is a type of RNN unit that is specially designed to address this problem.长短时记忆[18,11]（LSTM）是一种专门设计用于解决这个问题的RNN单元。
85	multiplicative (1)	['mʌltɪplɪkeɪtɪv]	An LSTM (illustrated in Fig. 3) consists of a memory cell and three multiplicative gates, namely the input, output and forget gates.LSTM（图3所示）由一个存储单元和三个多重门组成，即输入，输出和遗忘门。
86	Conceptually (1)	[kən'septʃʊəlɪ]	Conceptually, the memory cell stores the past contexts, and the input and output gates allow the cell to store contexts for a long period of time.在概念上，存储单元存储过去的上下文，并且输入和输出门允许单元长时间地存储上下文。
87	long-range (1)	[lɒŋ reɪndʒ]	The special design of LSTM allows it to capture long-range dependencies, which often occur in image-based sequences.LSTM的特殊设计允许它捕获长距离依赖，这经常发生在基于图像的序列中。
88	complementary (1)	[ˌkɒmplɪˈmentri]	However, in image-based sequences, contexts from both directions are useful and complementary to each other.然而，在基于图像的序列中，两个方向的上下文是相互有用且互补的。
89	concatenate (1)	[kɒn'kætɪneɪt]	At the bottom of the recurrent layers, the sequence of propagated differentials are concatenated into maps, inverting the operation of converting feature maps into feature sequences, and fed back to the convolutional layers.在循环层的底部，传播差异的序列被连接成映射，将特征映射转换为特征序列的操作进行反转并反馈到卷积层。
90	invert (1)	[ɪnˈvɜ:t]	At the bottom of the recurrent layers, the sequence of propagated differentials are concatenated into maps, inverting the operation of converting feature maps into feature sequences, and fed back to the convolutional layers.在循环层的底部，传播差异的序列被连接成映射，将特征映射转换为特征序列的操作进行反转并反馈到卷积层。
91	Mathematically (1)	[ˌmæθə'mætɪklɪ]	Mathematically, transcription is to find the label sequence with the highest probability conditioned on the per-frame predictions.数学上，转录是根据每帧预测找到具有最高概率的标签序列。
92	Connectionist (1)	[kə'nekʃənɪst]	We adopt the conditional probability defined in the Connectionist Temporal Classification (CTC) layer proposed by Graves et al. [15].我们采用Graves等人[15]提出的联接时间分类（CTC）层中定义的条件概率。
93	CTC (1)	[!≈ si: ti: si:]	We adopt the conditional probability defined in the Connectionist Temporal Classification (CTC) layer proposed by Graves et al. [15].我们采用Graves等人[15]提出的联接时间分类（CTC）层中定义的条件概率。
94	hh-e-l-ll-oo (1)		For example, B maps “–hh-e-l-ll-oo–” (’-’ represents ’blank’) onto “hello”.例如，${\cal B}$将“–hh-e-l-ll-oo–”（-表示blank）映射到“hello”。
95	computationally (1)	[!≈ ˌkɒmpjuˈteɪʃənli]	Directly computing Eq.1 would be computationally infeasible due to the exponentially large number of summation items.由于存在指数级数量的求和项，直接计算方程1在计算上是不可行的。
96	infeasible (1)	[ɪn'fi:zəbl]	Directly computing Eq.1 would be computationally infeasible due to the exponentially large number of summation items.由于存在指数级数量的求和项，直接计算方程1在计算上是不可行的。
97	exponentially (1)	[ˌekspə'nenʃəlɪ]	Directly computing Eq.1 would be computationally infeasible due to the exponentially large number of summation items.由于存在指数级数量的求和项，直接计算方程1在计算上是不可行的。
98	summation (1)	[sʌˈmeɪʃn]	Directly computing Eq.1 would be computationally infeasible due to the exponentially large number of summation items.由于存在指数级数量的求和项，直接计算方程1在计算上是不可行的。
99	tractable (1)	[ˈtræktəbl]	Since there exists no tractable algorithm to precisely find the solution, we use the strategy adopted in [15].由于不存在用于精确找到解的可行方法，我们采用[15]中的策略。
100	time-consuming (1)	[taɪm kən'sju:mɪŋ]	Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation.基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。
101	exhaustive (1)	[ɪgˈzɔ:stɪv]	Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation.基本上，通过选择词典中具有方程1中定义的最高条件概率的序列来识别标签序列，即$\mathbf{l}^{}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}\|\mathbf{y})$。
102	nearest-neighbor (1)	['nɪərɪstn'eɪbɔ:]	This indicates that we can limit our search to the nearest-neighbor candidates ${\cal N}_{\delta}(\mathbf{l}’)$, where $\delta$ is the maximal edit distance and $\mathbf{l}’$ is the sequence transcribed from $\mathbf{y}$ in lexicon-free mode:这表示我们可以将搜索限制在最近邻候选目标${\cal N}_{\delta}(\mathbf{l}’)$，其中$\delta$是最大编辑距离，$\mathbf{l}’$是在无词典模式下从$\mathbf{y}$转录的序列：
103	maximal (1)	[ˈmæksɪml]	This indicates that we can limit our search to the nearest-neighbor candidates ${\cal N}_{\delta}(\mathbf{l}’)$, where $\delta$ is the maximal edit distance and $\mathbf{l}’$ is the sequence transcribed from $\mathbf{y}$ in lexicon-free mode:这表示我们可以将搜索限制在最近邻候选目标${\cal N}_{\delta}(\mathbf{l}’)$，其中$\delta$是最大编辑距离，$\mathbf{l}’$是在无词典模式下从$\mathbf{y}$转录的序列：
104	transcribe (1)	[trænˈskraɪb]	This indicates that we can limit our search to the nearest-neighbor candidates ${\cal N}_{\delta}(\mathbf{l}’)$, where $\delta$ is the maximal edit distance and $\mathbf{l}’$ is the sequence transcribed from $\mathbf{y}$ in lexicon-free mode:这表示我们可以将搜索限制在最近邻候选目标${\cal N}_{\delta}(\mathbf{l}’)$，其中$\delta$是最大编辑距离，$\mathbf{l}’$是在无词典模式下从$\mathbf{y}$转录的序列：
105	stochastic (1)	[stə'kæstɪk]	The network is trained with stochastic gradient descent (SGD).网络使用随机梯度下降（SGD）进行训练。
106	descent (1)	[dɪˈsent]	The network is trained with stochastic gradient descent (SGD).网络使用随机梯度下降（SGD）进行训练。
107	SGD (1)	['esdʒ'i:d'i:]	The network is trained with stochastic gradient descent (SGD).网络使用随机梯度下降（SGD）进行训练。
108	back-propagated (1)	[!≈ bæk ˈprɔpəɡeitid]	In particular, in the transcription layer, error differentials are back-propagated with the forward-backward algorithm, as described in [15].特别地，在转录层中，如[15]所述，误差使用前向算法进行反向传播。
109	Synth (1)	[sɪnθ]	For all the experiments for scene text recognition, we use the synthetic dataset (Synth) released by Jaderberg et al. [20] as the training data.对于场景文本识别的所有实验，我们使用Jaderberg等人[20]发布的合成数据集（Synth）作为训练数据。
110	Jaderberg (1)		For all the experiments for scene text recognition, we use the synthetic dataset (Synth) released by Jaderberg et al. [20] as the training data.对于场景文本识别的所有实验，我们使用Jaderberg等人[20]发布的合成数据集（Synth）作为训练数据。
111	IIIT (1)	[!≈ aɪ aɪ aɪ ti:]	Four popular benchmarks for scene text recognition are used for performance evaluation, namely ICDAR 2003 (IC03), ICDAR 2013 (IC13), IIIT 5k-word (IIIT5k), and Street View Text (SVT).有四个流行的基准数据集用于场景文本识别的性能评估，即ICDAR 2003（IC03），ICDAR 2013（IC13），IIIT 5k-word（IIIT5k）和Street View Text (SVT)。
112	non-alphanumeric (1)	[!≈ nɒn ˌælfənju:ˈmerɪk]	Following Wang et al. [34], we ignore images that either contain non-alphanumeric characters or have less than three characters, and get a test set with 860 cropped text images.王等人[34]，我们忽略包含非字母数字字符或少于三个字符的图像，并获得具有860个裁剪的文本图像的测试集。
113	VGG-VeryDeep (1)		The architecture of the convolutional layers is based on the VGG-VeryDeep architectures [32].卷积层的架构是基于VGG-VeryDeep的架构[32]。
114	Xeon (1)		Experiments are carried out on a workstation with a 2.50 GHz Intel(R) Xeon(R) E5-2609 CPU, 64GB RAM and an NVIDIA(R) Tesla(TM) K40 GPU.实验在具有2.50 GHz Intel（R）Xeon E5-2609 CPU，64GB RAM和NVIDIA（R）Tesla(TM) K40 GPU的工作站上进行。
115	NVIDIA (1)	[ɪn'vɪdɪə]	Experiments are carried out on a workstation with a 2.50 GHz Intel(R) Xeon(R) E5-2609 CPU, 64GB RAM and an NVIDIA(R) Tesla(TM) K40 GPU.实验在具有2.50 GHz Intel（R）Xeon E5-2609 CPU，64GB RAM和NVIDIA（R）Tesla(TM) K40 GPU的工作站上进行。
116	Tesla (1)	['teslә]	Experiments are carried out on a workstation with a 2.50 GHz Intel(R) Xeon(R) E5-2609 CPU, 64GB RAM and an NVIDIA(R) Tesla(TM) K40 GPU.实验在具有2.50 GHz Intel（R）Xeon E5-2609 CPU，64GB RAM和NVIDIA（R）Tesla(TM) K40 GPU的工作站上进行。
117	TM (1)	[!≈ ti: em]	Experiments are carried out on a workstation with a 2.50 GHz Intel(R) Xeon(R) E5-2609 CPU, 64GB RAM and an NVIDIA(R) Tesla(TM) K40 GPU.实验在具有2.50 GHz Intel（R）Xeon E5-2609 CPU，64GB RAM和NVIDIA（R）Tesla(TM) K40 GPU的工作站上进行。
118	K40 (1)		Experiments are carried out on a workstation with a 2.50 GHz Intel(R) Xeon(R) E5-2609 CPU, 64GB RAM and an NVIDIA(R) Tesla(TM) K40 GPU.实验在具有2.50 GHz Intel（R）Xeon E5-2609 CPU，64GB RAM和NVIDIA（R）Tesla(TM) K40 GPU的工作站上进行。
119	convergence (1)	[kən'vɜ:dʒəns]	The training process takes about 50 hours to reach convergence.训练过程大约需要50个小时才能达到收敛。
120	proportionally (1)	[prə'pɔ:ʃənlɪ]	Widths are proportionally scaled with heights, but at least 100 pixels.宽度与高度成比例地缩放，但至少为100像素。
121	Comparative (1)	[kəmˈpærətɪv]	3.3. Comparative Evaluation3.3. 比较评估
122	consistently (1)	[kən'sɪstəntlɪ]	In the constrained lexicon cases, our method consistently outperforms most state-of-the-arts approaches, and in average beats the best text reader proposed in [22].在有约束词典的情况中，我们的方法始终优于大多数最新的方法，并且平均打败了[22]中提出的最佳文本阅读器。
123	PhotoOCR (1)		Our method uses only synthetic text with word level labels as the training data, very different to PhotoOCR [8] which used 7.9 millions of real word images with character-level annotations for training.我们的方法只使用具有单词级标签的合成文本作为训练数据，与PhotoOCR[8]非常不同，后者使用790万个具有字符级标注的真实单词图像进行训练。
124	persformance (1)		The best persformance is reported by [22] in the unconstrained lexicon cases, benefiting from its large dictionary, however, it is not a model strictly unconstrained to a lexicon as mentioned before.[22]中报告的最佳性能是在无约束词典的情况下，受益于它的大字典，然而，它不是前面提到的严格的无约束词典模型。
125	hand-crafted (1)	[,hænd 'kra:ftid]	Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions).比较的属性包括：1)端到端训练(E2E Train)；2)从图像中直接学习卷积特征而不是使用手动设计的特征(Conv Ftrs)；3)训练期间不需要字符的实际边界框(CharGT-Free)；4)不受限于预定义字典(Unconstrained)；5)模型大小（如果使用端到端模型），通过模型参数数量来衡量(Model Size, M表示百万)。
126	handcraft (1)	[ˈhændkrɑ:ft]	Conv Ftrs: This column is to indicate whether an approach uses the convolutional features learned from training images directly or handcraft features as the basic representations.Conv Ftrs：这一列用来表明一个方法是否使用从训练图像直接学习到的卷积特征或手动特征作为基本的表示。
127	incremental (1)	[ˌɪŋkrə'mentl]	Notice that though the recent models learned by label embedding [5, 14] and incremental learning [22] achieved highly competitive performance, they are constrained to a specific dictionary.注意尽管最近通过标签嵌入[5, 14]和增强学习[22]学习到的模型取得了非常有竞争力的性能，但它们受限于一个特定的字典。
128	weight-sharing (1)	[!≈ weɪt 'ʃeərɪŋ]	In CRNN, all layers have weight-sharing connections, and the fully-connected layers are not needed.在CRNN中，所有的层有权重共享连接，不需要全连接层。
129	variant (1)	[ˈveəriənt]	Consequently, the number of parameters of CRNN is much less than the models learned on the variants of CNN [22, 21], resulting in a much smaller model compared with [22, 21].因此，CRNN的参数数量远小于CNN变体[22,21]所得到的模型，导致与[22,21]相比，模型要小得多。
130	MB (1)	[!≈ em bi:]	Our model has 8.3 million parameters, taking only 33MB RAM (using 4-bytes single-precision float for each parameter), thus it can be easily ported to mobile devices.我们的模型有830万个参数，只有33MB RAM（每个参数使用4字节单精度浮点数），因此可以轻松地移植到移动设备上。
131	Eq.2. (1)		In addition, to test the impact of parameter \delta, we experiment different values of \delta in Eq.2.另外，为了测试参数\delta的影响，我们在方程2中实验了\delta的不同值。
132	tradeoff (1)	['treɪdˌɔ:f]	In practice, we choose \delta=3 as a tradeoff between accuracy and speed.实际上，我们选择\delta=3作为精度和速度之间的折衷。
133	binirization (1)		Previous methods often requires image preprocessing (mostly binirization), staff lines detection and individual notes recognition [29].以前的方法通常需要图像预处理（主要是二值化），五线谱检测和单个音符识别[29]。
134	ezpitches (1)		We manually label the ground truth label sequences (sequences of not ezpitches) for all the images.我们手动标记所有图像的真实标签序列（不是的音调序列）。
135	augment (1)	[ɔ:gˈment]	The collected images are augmented to 265k training samples by being rotated, scaled and corrupted with noise, and by replacing their backgrounds with natural images.收集到的图像通过旋转，缩放和用噪声损坏增强到了265k个训练样本，并用自然图像替换它们的背景。
136	augmentation (1)	[ˌɔ:ɡmen'teɪʃn]	a; 2) “Synthesized”, which is created from “Clean”, using the augmentation strategy mentioned above.实例如图5.a所示；2）“合成的”，使用“纯净的”创建的，使用了上述的增强策略。
137	clutter (1)	[ˈklʌtə(r)]	The main reason is that they rely on robust binarization to detect staff lines and notes, but the binarization step often fails on synthesized and real-world data due to bad lighting condition, noise corruption and cluttered background.主要原因是它们依赖于强大的二值化来检五线谱和音符，但是由于光线不良，噪音破坏和杂乱的背景，二值化步骤经常会在合成数据和现实数据上失败。
138	minimal (1)	[ˈmɪnɪməl]	The results have shown the generality of CRNN, in that it can be readily applied to other image-based sequence recognition problems, requiring minimal domain knowledge.结果显示了CRNN的泛化性，因为它可以很容易地应用于其它的基于图像的序列识别问题，需要极少的领域知识。
139	preliminary (1)	[prɪˈlɪmɪnəri]	Compared with Capella Scan and PhotoScore, our CRNN-based system is still preliminary and misses many functionalities.与Capella Scan和PhotoScore相比，我们的基于CRNN的系统仍然是初步的，并且缺少许多功能。
140	functionality (1)	[ˌfʌŋkʃəˈnæləti]	Compared with Capella Scan and PhotoScore, our CRNN-based system is still preliminary and misses many functionalities.与Capella Scan和PhotoScore相比，我们的基于CRNN的系统仍然是初步的，并且缺少许多功能。

Words List (appearance)

Words List (frequency)